List Info

Thread: set the encoding to the parser




set the encoding to the parser
country flaguser name
Spain
2007-03-21 10:08:57
I have to parser some XML files using XML::LibXML parser.
But, the files 
don't have the "encoding" header, and they're
encoding is not the 
default UTF-8 (it's ISO-8859-1).
Is possible to set the encoding to the parser before parsing
the file? 
What can I do? (I don't want to change the files)
Thanks
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

Re: set the encoding to the parser
user name
2007-03-21 11:30:47
Arantxa Otegi writes:
> I have to parser some XML files using XML::LibXML
parser. But, the
> files don't have the "encoding" header, and
they're encoding is not
> the default UTF-8 (it's ISO-8859-1).  Is possible to
set the encoding
> to the parser before parsing the file?  What can I do?
(I don't want
> to change the files)

If the files are well-formed XML, and they use ISO 8859-1 as
their
character encoding, then they must declare that encoding at
the top.
XML::LibXML Just Works when you feed it such a file:

  $ hexdump -C latin1.xml 
  00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31
 |<?xml version="1|
  00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 49 53
 |.0" encoding="IS|
  00000020  4f 2d 38 38 35 39 2d 31  22 3f 3e 0a 3c 61 3e f7
 -8859-1&q
uot;?>.<a>.|
  00000030  3c 2f 61 3e 0a                                  
 |</a>.|
  00000035
  $ perl -MXML::LibXML -we 'binmode STDOUT,
":utf8"; print
XML::LibXML->new->parse_file("latin1.xml")-&
gt;textContent' | hexdump -C
  00000000  c3 b7                                           
 |..|
  00000002

However, given that you're asking the question, I suspect
that the files
you've got are malformed, in that they don't declare their
non-default
encoding.  And, yes, XML::LibXML correctly refuses to handle
such
malformed inputs.

If you're sure that the files you've got are _all_ broken in
this way,
a reasonable approach would be to transcode them to UTF-8
before
attempting to feed them to XML::LibXML.  The simplest way of
doing that
is probably to open the file using the encoding you
"know" the file has:

  $ tail -n 1 latin1.xml > malformed.xml
  $ perl -MXML::LibXML -we 'binmode STDOUT,
":utf8"; open my $fh,
"<:encoding(latin1)", "malformed.xml"
or die $!; print XML::LibXML->new->parse_fh($fh,
"malformed.xml")->textContent' | hexdump -C
  00000000  c3 b7                                           
 |..|
  00000002

If only some of the files are malformed in this way, you can
first
attempt to parse them directly, and fall back to enforcing
Latin-1 only
if that fails.  That's theoretically unreliable, because
some valid
Latin-1 sequences are also valid UTF-8, but statistically
it's wildly
improbable that it would cause a problem for textual data. 
And the data
loss has already happened, when the broken files were
created, so you
can't actually fix it properly anyway.

-- 
Aaron Crane
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

Re: set the encoding to the parser
country flaguser name
United States
2007-03-21 11:51:30
On Wed, Mar 21, 2007 at 04:08:57PM +0100, Arantxa Otegi
wrote:
> I have to parser some XML files using XML::LibXML
parser. But, the files 
> don't have the "encoding" header, and they're
encoding is not the 
> default UTF-8 (it's ISO-8859-1).
> Is possible to set the encoding to the parser before
parsing the file? 
> What can I do? (I don't want to change the files)

can you prepend an XML declaration?
<?xml version="VVV"
encoding="ISO-8859-1"?>

E.g. read the file into memory but stick that at the
start...

if not, as Aaron Crane suggested, convert using binmode,
or run iconv externally, or do

    $text =~ s/[240-255]/"&#" . ord(&) .
";"/eg;

to turn legal 8-bit ISO8859-1 characers into numeric
character
references (e.g. &#160;).

It would be best to fix the files though.

Liam

-- 
Liam Quin, W3C XML Activity Lead, http://www.w3.org/Peop
le/Quin/
http://www.holoweb.net/
~liam/ * http://www.fromoldbooks.
org/
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

Re: set the encoding to the parser
country flaguser name
Czech Republic
2007-03-21 17:01:40
On Wednesday 21 March 2007, Arantxa Otegi wrote:
> I have to parser some XML files using XML::LibXML
parser. But, the files
> don't have the "encoding" header, and they're
encoding is not the
> default UTF-8 (it's ISO-8859-1).
> Is possible to set the encoding to the parser before
parsing the file?
> What can I do? (I don't want to change the files)
> Thanks

Currently you can't do that. Possible solutions:

1) prepend <?xml version="1.0"
encoding="iso-8859-1"?> to the XML string you 
send to the parser. You may either read the whole document
into memory before 
parsing, or use the push parser (see XML::LibXML::Parser for
details).

2) re-encode the input into UTF-8 (using parse_fh on a FH 
with ':encoding(iso-8859-1)' IO layer could do the job)

The new libxml2 API allows one to force encoding on the
parser, but 
XML::LibXML still uses the old one. In some future version,
hopefully...

-- Petr
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )