List Info

Thread: Re: Can I prevent XML::DOM::Parser from resolving character entities?




Re: Can I prevent XML::DOM::Parser from resolving character entities?
country flaguser name
United States
2007-07-12 19:55:01
Thanks! That does what I need, except...

My experience doesn't quite match what the FAQ says to
expect. Using Perl
5.6.0:

   use utf8;
   s/([x-x])/'&#' . ord($1) . ';'/gse;

   Produces:

   In XML input:     Output after regex:
   ™      =>  ™   [trademark
symbol]
   é      =>  é    [lowercase e with
acute accent]


   use utf8;  # [note the FAQ says this is not required with
5.6]
   s/([^x20-x7F])/'&#' . ord($1) . ';'/gse;

   Produces:

   In XML input:     Output after regex:
   ™      =>  ™
   é      =>  é

   But leaving out 'use utf8'; and still using the second
regex:

   In XML input:     Output after regex:
   ™      =>  â„¢
   é      =>  é



On 7/12/07 4:02 PM, "Grant McLean" <grantmclean.net.nz> wrote:

> On Thu, 2007-07-12 at 11:14 -0500, Michael Boudreau
wrote:
> Hi all,
>
> I'm
> using XML:OM to
parse a file that includes common character entities
> such
> as '&#x00E9;' (lowercase e with acute accent). I'd
like to be able to
> pluck
> bits of text from this document and preserve the
character entity,
> rather
> than having it resolved into 'é'.

I can't comment on how to disable essential functionality of
your chosen
XML parser, but the Perl-XML FAQ offers a regex for
converting non-ASCII
characters to numeric character entities:


> http://perl-xml.sourceforge.net/faq/#numeric_char_ent

Cheers
Grant

_________
> ______________________________________
Perl-XML mailing
> list
Perl-XMLlistserv.ActiveState.com
To unsubscribe:
> http:/
/listserv.ActiveState.com/mailman/mysubs


-- 
Michael R. Boudreau
Senior Publishing Technology Analyst
The University of Chicago Press
1427 E. 60th Street
Chicago, IL 60637
(773) 753-3298    fax: (773) 753-3383



_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

Re: Can I prevent XML::DOM::Parser from resolving character entities?
country flaguser name
New Zealand
2007-07-12 20:31:33
On Thu, 2007-07-12 at 19:55 -0500, Michael Boudreau wrote:
> Thanks! That does what I need, except...
> 
> My experience doesn't quite match what the FAQ says to
expect. Using Perl
> 5.6.0:
> 
>    use utf8;
>    s/([x-x])/'&#' . ord($1) . ';'/gse;
> 
>    Produces:
> 
>    In XML input:     Output after regex:
>    &#x2122;      =>  &#8482;   [trademark
symbol]
>    &#x00E9;      =>  &#233;    [lowercase e
with acute accent]
> 
> 
>    use utf8;  # [note the FAQ says this is not required
with 5.6]

It's not required with 5.8.  It is required with 5.6.

>    s/([^x20-x7F])/'&#' . ord($1) . ';'/gse;

Sorry, I keep forgetting to update the FAQ you probably
really want:

  s/([^x00-x7F])/'&#' . ord($1) . ';'/gse;

Otherwise it does all your CR, LF and Tab characters too.

>    Produces:
> 
>    In XML input:     Output after regex:
>    &#x2122;      =>  &#8482;
>    &#x00E9;      =>  &#233;
> 
>    But leaving out 'use utf8'; and still using the
second regex:
> 
>    In XML input:     Output after regex:
>    &#x2122;      => 
&#226;&#132;&#162;
>    &#x00E9;      =>  &#195;&#169;

Here's a short test script that demonstrates the regex
working in 5.8
without the 'use utf8' line:

#!/usr/bin/perl

require 5.008;
use strict;
use warnings;

my $string = "TM: x";

$string =~ s/([^x00-x7F])/'&#' . ord($1) . ';'/gse;

print $string, "n";

which outputs:

TM: &#8482;

Cheers
Grant

_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

Re: Can I prevent XML::DOM::Parser from resolving character entities?
user name
2007-07-13 05:37:42
Grant McLean writes:
> Here's a short test script that demonstrates the regex
working in 5.8
> without the 'use utf8' line:
> 
> $string =~ s/([^x00-x7F])/'&#' . ord($1) .
';'/gse;

If you've got 5.8, then using Encode produces code that's
much easier
to read (and also executes rather more quickly):

  use Encode qw<encode FB_HTMLCREF>;

  my $ascii_only = encode('US-ASCII', $string,
FB_HTMLCREF);

You can use FB_XMLCREF to get hex rather than decimal
character
references.

People still using 5.6 should be able to use Encode::compat
to make
Encode work.

-- 
Aaron Crane
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )