I don't have time to test this, and I could be wrong, but it
seems to me if you're going down the pre-processing route,
you could just entity-escape the leading & of these
entities, and then your XML parser should reconstruct the
original entity.
Something like s/&#/&#/g , or if & isn't
defined, s/&#/&/g on your original document
should do the trick.
It seems that might be a little easier than the
post-processing route you're experimenting with now.
Forrest Cahoon
not speaking for merrill corporation
> -----Original Message-----
> From: perl-xml-bounces listserv.ActiveState.com
> [mailto:perl-xml-bounces listserv.ActiveState.com]
On Behalf
> Of Michael Boudreau
> Sent: Thursday, July 12, 2007 11:15 AM
> To: perl-xml listserv.ActiveState.com
> Subject: Can I prevent XML: OM::Pars
er from resolving
> character entities?
>
> Hi all,
>
> I'm using XML: OM to
parse a file that includes common
> character entities such as 'é' (lowercase e
with acute
> accent). I'd like to be able to pluck bits of text from
this
> document and preserve the character entity, rather than
> having it resolved into 'é'.
>
> I've seen this question in the archives of this list,
but the
> threads don't include a solution, or else I've
misunderstood
> something. Can anybody point me in the right
direction?
>
> [As a last resort, I may pre-process the XML file (turn
> 'é' into something like '{}') and then
reverse
> the substitution on the other side. But I'd rather
persuade
> XML: OM::Pars
er to be a little less helpful.]
>
>
> use XML: OM;
>
> my $dom_parser = new XML: OM::Pars
er;
>
> my $doc = $dom_parser->parsefile(
"$xml_file" );
>
> my $text = get_unique_element_content($doc,
'elementname');
>
> sub get_unique_element_content {
>
> my ($doc, $tagname) = ( _);
> my $content = '';
>
> my $nodelist =
$doc->getElementsByTagName($tagname);
> my $firstnode = $nodelist->item(0) or return
$content;
> my children = $firstnode->getChildNodes;
> foreach my $node ( children ) {
> $content = $node->getNodeValue if
$node->getNodeType eq
> TEXT_NODE;
> }
> return $content; ## $content has é instead of
é :-( }
>
>
> --
> Michael R. Boudreau
> Senior Publishing Technology Analyst
> The University of Chicago Press
> 1427 E. 60th Street
> Chicago, IL 60637
> (773) 753-3298 fax: (773) 753-3383
>
>
> _______________________________________________
> Perl-XML mailing list
> Perl-XML listserv.ActiveState.com
> To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
>
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|