List Info

Thread: XML::LibXML - Newlines in attributes transformed to spaces




XML::LibXML - Newlines in attributes transformed to spaces
country flaguser name
Norway
2007-03-02 07:46:23
I have an XML document containing one or more consecutive
newlines in an 
_attribute_ of one of the elements. The document is
well-formed XML according 
to xmllint. When I try to parse this document using
XML::LibXML, and extract 
the content of the attribute using findvalue(), the newlines
have been 
transformed to spaces.

My application need to be able to distinguish spaces and
newlines in the
input documents (which are being generated by a third
party).
Is this at all possible?

Have tried this on two different systems, one running
versions 2.5.11/1.58,
the other running 2.6.26/1.61. The results are the same on
both systems.

Complete test case:

--- begin "test.xml" ---
<foo>
  <bar baz="this
is

a

test"/>
</foo>
--- end "test.xml" ---

--- begin "test.pl" ---
#!/usr/bin/perl

use strict;
use warnings;
use XML::LibXML;

my $parser = XML::LibXML->new();
my $dom = $parser->parse_file('test.xml');
my $baz = $dom->findvalue('/foo/bar/baz');

print "baz:n", $baz;
--- end "test.pl" ---

$ ./test.pl > test.out
$ hexdump -C test.out
00000000  62 61 7a 3a 0a 74 68 69  73 20 69 73 20 20 20 20 
|baz:.this is    |
00000010  61 20 20 74 65 73 74                             
|a  test|
00000017

-- 
Lars Haugseth
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

Re: XML::LibXML - Newlines in attributes transformed to spaces
user name
2007-03-02 08:59:37
Lars Haugseth writes:
> My application need to be able to distinguish spaces
and newlines in the
> input documents (which are being generated by a third
party).
> Is this at all possible?

XML requires normalization of literal whitespace in
attribute values;
the relevant bit of the spec is here:

  http://www.w3.org/TR/2006/REC-xml-20060816/#AVNormalize

Given that, any XML parser that (by default) gives different
results for
these documents is broken:

  $ echo -e '<a b="c d"/>' | hexdump -C
  00000000  3c 61 20 62 3d 22 63 20  64 22 2f 3e 0a      
|<a b="c d"/>.|
  0000000d
  $ echo -e '<a b="cnd"/>' | hexdump -C
  00000000  3c 61 20 62 3d 22 63 0a  64 22 2f 3e 0a      
|<a b="c.d"/>.|
  0000000d

I'm not sure which (if any) Perl XML parsers have non-XML
modes which
disable this normalization, but I am sure that trying to use
XML parsers
to handle nearly-XML documents is painful.  In this case,
I'd try to
speak to the third party that generates these documents, and
hopefully
get them to start producing documents that represent what
they think
they do.

If you need XML documents which make this distinction, this
is how to
write them:

  <a b="c d"/>
  <a b="c&#32;d"/>  <!-- same as
previous -->
  <a b="c&#10;d"/>
  <a b="c&#xA;d"/>  <!-- same as
previous -->
  <a b="c&#9;d"/>   <!-- literal tabs
are also normalized to spaces -->
  <a b="c&#x9;d"/>  <!-- same as
previous -->

-- 
Aaron Crane
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

Re: XML::LibXML - Newlines in attributes transformed to spaces
country flaguser name
Norway
2007-03-02 09:10:46
Thanks to Aaron and mirod for pointing out that the flaw is
in the input,
which by the way is generated by an application outside of
our control.

Unfortunately it's a long-winded and expensive process to
get that application 
changed. The only option left for us at the moment seems to
be adding a 
filter on the  XML data in front of parsing it, replacing
newlines within 
attributes with suitable entities.

It's a dirty hack, but sometimes that will have to do when
you get 
garbage in and want something else to come out.

-- 
Lars Haugseth
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )