Hi Daniel / all
I encountered some problems with the HTML chunk parser if
certain HTML
keywords overlap the end of the chunk (calling
htmlParseChunk() ). It
seems that the HTML parser does not recognize it in certain
cases and
loses the context. In order to describe the problem more
clearly, I
created an easy test that can be reproduced using the
"testHTML.c" of
libxml2. These findings are based on libxml2-2.6.24 and I
did not find
this issue already documented.
Description:
------------
If the function htmlParseChunk() is called with a chunk of
bytes where a
closing </script> or </style> tag is overlapping
the end of the chunk,
the HTML parser will fail to recognize the closing tag and
it will
interpret the second part of the closing tag as CDATA. This
gives
unpredictable results with SAX callbacks for the rest of the
HTML
content.
Example:
--------
Call the function htmlParseChunk() with two buffers
subsequently in a
row, like the following examples (buffer bytes between the
quotes):
Buffer1:
"<html><body><script></"
Buffer2:
"script> <a
href='test'>LINK</a></body></html>&q
uot;
The two buffers concatenated are valid HTML with an empty
script block.
There is no special character between the two buffers. An
application
using the SAX callbacks will be called like this:
- startElement("html")
- startElement("body")
- startElement("script")
- cdata("</")
<==== ouch! we expect
endElement("script")
- cdata("script> <a href='test'>..."
- ...
The HTML parser needs a closing </script> tag again to
get back into the
game.
Test with testHTML.c:
---------------------
The easiest way for anybody to test this behaviour is the
following:
Reduce the chunk size variable "size" in
testHTML.c from 4096 to 10 on
the lines 641 and 671. This makes sure that testHTML uses
small chunks
so we can process a small test file.
Use the following HTML content as test HTML file that we
call
chunktest.html (without the dashes):
------------------------------
<html><body>.......
..........<script></script>
<a href="test">LINK</a>
<script></script>
</body>
</html>
------------------------------
Note that testHTML first consumes 4 bytes and then 10 at a
time (after
the change from above). The first line contains 23
characters plus the
newline and therefore the closing </script> tag will
overlap the next
chunk border.
Using the following command with this test file I get the
following
output that shows how the closing </script> tag is
interpreted as CDATA
content:
# ./testHTML --push --sax --debug chunktest.html
SAX.setDocumentLocator()
SAX.startDocument()
SAX.startElement(html)
SAX.startElement(body)
SAX.characters(.......
.........., 18)
SAX.startElement(script)
SAX.error: Invalid char in CDATA 0x0
SAX.cdata(</, 2)
SAX.error: htmlParseEndTag: '</' not found
SAX.cdata(cript>
<a href="test", 26)
SAX.error: Unexpected end tag : a
SAX.cdata(
<script>, 9)
SAX.endElement(script)
SAX.characters(
, 1)
SAX.endElement(body)
SAX.ignorableWhitespace(
, 1)
SAX.endElement(html)
SAX.ignorableWhitespace(
, 1)
SAX.endDocument()
I assume that this is a bug and the HTML parser should be
able to handle
HTML tags that overlap the chunk boundary. If I'm wrong on
that
assumption then of course the caller would have to make sure
that no
tags are overlapping. This however would require to parse
the HTML
before calling htmlParseChunk().... erm.. boom
Best regards
Cyrill
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
xml gnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
|