Arantxa Otegi writes:
> I have to parser some XML files using XML::LibXML
parser. But, the
> files don't have the "encoding" header, and
they're encoding is not
> the default UTF-8 (it's ISO-8859-1). Is possible to
set the encoding
> to the parser before parsing the file? What can I do?
(I don't want
> to change the files)
If the files are well-formed XML, and they use ISO 8859-1 as
their
character encoding, then they must declare that encoding at
the top.
XML::LibXML Just Works when you feed it such a file:
$ hexdump -C latin1.xml
00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31
|<?xml version="1|
00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 49 53
|.0" encoding="IS|
00000020 4f 2d 38 38 35 39 2d 31 22 3f 3e 0a 3c 61 3e f7
-8859-1&q
uot;?>.<a>.|
00000030 3c 2f 61 3e 0a
|</a>.|
00000035
$ perl -MXML::LibXML -we 'binmode STDOUT,
":utf8"; print
XML::LibXML->new->parse_file("latin1.xml")-&
gt;textContent' | hexdump -C
00000000 c3 b7
|..|
00000002
However, given that you're asking the question, I suspect
that the files
you've got are malformed, in that they don't declare their
non-default
encoding. And, yes, XML::LibXML correctly refuses to handle
such
malformed inputs.
If you're sure that the files you've got are _all_ broken in
this way,
a reasonable approach would be to transcode them to UTF-8
before
attempting to feed them to XML::LibXML. The simplest way of
doing that
is probably to open the file using the encoding you
"know" the file has:
$ tail -n 1 latin1.xml > malformed.xml
$ perl -MXML::LibXML -we 'binmode STDOUT,
":utf8"; open my $fh,
"<:encoding(latin1)", "malformed.xml"
or die $!; print XML::LibXML->new->parse_fh($fh,
"malformed.xml")->textContent' | hexdump -C
00000000 c3 b7
|..|
00000002
If only some of the files are malformed in this way, you can
first
attempt to parse them directly, and fall back to enforcing
Latin-1 only
if that fails. That's theoretically unreliable, because
some valid
Latin-1 sequences are also valid UTF-8, but statistically
it's wildly
improbable that it would cause a problem for textual data.
And the data
loss has already happened, when the broken files were
created, so you
can't actually fix it properly anyway.
--
Aaron Crane
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|