List Info

Thread: Can I prevent XML::DOM::Parser from resolving character entities?




Can I prevent XML::DOM::Parser from resolving character entities?
country flaguser name
United States
2007-07-12 11:14:51
Hi all,

I'm using XML:OM to
parse a file that includes common character entities
such as 'é' (lowercase e with acute accent). I'd
like to be able to
pluck bits of text from this document and preserve the
character entity,
rather than having it resolved into 'é'.

I've seen this question in the archives of this list, but
the threads don't
include a solution, or else I've misunderstood something.
Can anybody point
me in the right direction?

[As a last resort, I may pre-process the XML file (turn
'é' into
something like '{}') and then reverse the substitution
on the other
side. But I'd rather persuade XML:OM::Pars
er to be a little less helpful.]


use XML:OM;

my $dom_parser = new XML:OM::Pars
er;

my $doc = $dom_parser->parsefile( "$xml_file"
);

my $text = get_unique_element_content($doc, 'elementname');

sub get_unique_element_content {

   my ($doc, $tagname) = (_);
   my $content = '';

   my $nodelist = $doc->getElementsByTagName($tagname);
   my $firstnode = $nodelist->item(0) or return
$content;
   my children = $firstnode->getChildNodes;
   foreach my $node ( children ) {
      $content = $node->getNodeValue if
$node->getNodeType eq TEXT_NODE;
   }
   return $content;  ## $content has é instead of
é  :-(
}


-- 
Michael R. Boudreau
Senior Publishing Technology Analyst
The University of Chicago Press
1427 E. 60th Street
Chicago, IL 60637
(773) 753-3298    fax: (773) 753-3383


_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

Re: Can I prevent XML::DOM::Parser from resolving character entities?
country flaguser name
New Zealand
2007-07-12 16:02:50
On Thu, 2007-07-12 at 11:14 -0500, Michael Boudreau wrote:
> Hi all,
> 
> I'm using XML:OM to
parse a file that includes common character entities
> such as 'é' (lowercase e with acute accent).
I'd like to be able to
> pluck bits of text from this document and preserve the
character entity,
> rather than having it resolved into 'é'.

I can't comment on how to disable essential functionality of
your chosen
XML parser, but the Perl-XML FAQ offers a regex for
converting non-ASCII
characters to numeric character entities:

  http://perl-xml.sourceforge.net/faq/#numeric_char_ent

Cheers
Grant

_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
RE: Can I prevent XML::DOM::Parser from resolving character entities?
country flaguser name
United States
2007-07-13 08:09:05
I don't have time to test this, and I could be wrong, but it
seems to me if you're going down the pre-processing route,
you could just entity-escape the leading & of these
entities, and then your XML parser should reconstruct the
original entity.

Something like s/&#/&#/g , or if & isn't
defined, s/&#/&/g on your original document
should do the trick.

It seems that might be a little easier than the
post-processing route you're experimenting with now.

Forrest Cahoon
not speaking for merrill corporation

> -----Original Message-----
> From: perl-xml-bounceslistserv.ActiveState.com 
> [mailto:perl-xml-bounceslistserv.ActiveState.com]
On Behalf 
> Of Michael Boudreau
> Sent: Thursday, July 12, 2007 11:15 AM
> To: perl-xmllistserv.ActiveState.com
> Subject: Can I prevent XML:OM::Pars
er from resolving 
> character entities?
> 
> Hi all,
> 
> I'm using XML:OM to
parse a file that includes common 
> character entities such as 'é' (lowercase e
with acute 
> accent). I'd like to be able to pluck bits of text from
this 
> document and preserve the character entity, rather than

> having it resolved into 'é'.
> 
> I've seen this question in the archives of this list,
but the 
> threads don't include a solution, or else I've
misunderstood 
> something. Can anybody point me in the right
direction?
> 
> [As a last resort, I may pre-process the XML file (turn

> 'é' into something like '{}') and then
reverse 
> the substitution on the other side. But I'd rather
persuade 
> XML:OM::Pars
er to be a little less helpful.]
> 
> 
> use XML:OM;
> 
> my $dom_parser = new XML:OM::Pars
er;
> 
> my $doc = $dom_parser->parsefile(
"$xml_file" );
> 
> my $text = get_unique_element_content($doc,
'elementname');
> 
> sub get_unique_element_content {
> 
>    my ($doc, $tagname) = (_);
>    my $content = '';
> 
>    my $nodelist =
$doc->getElementsByTagName($tagname);
>    my $firstnode = $nodelist->item(0) or return
$content;
>    my children = $firstnode->getChildNodes;
>    foreach my $node ( children ) {
>       $content = $node->getNodeValue if
$node->getNodeType eq 
> TEXT_NODE;
>    }
>    return $content;  ## $content has é instead of
é  :-( }
> 
> 
> --
> Michael R. Boudreau
> Senior Publishing Technology Analyst
> The University of Chicago Press
> 1427 E. 60th Street
> Chicago, IL 60637
> (773) 753-3298    fax: (773) 753-3383
> 
> 
> _______________________________________________
> Perl-XML mailing list
> Perl-XMLlistserv.ActiveState.com
> To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
> 
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )