List Info

Thread: HTML Parser problems with chunk parser if HTML keywords overlap chunk border




HTML Parser problems with chunk parser if HTML keywords overlap chunk border
user name
2006-06-21 14:29:56
Hi all

After some more research I believe to have found the reason
for the
problem with the CDATA parsing. In case PARSE_HTML_RECOVER
is true, the
following criteria in htmlParseTryOrFinish() is not enough
for calling
htmlParseScript():

/*
 * Handle SCRIPT/STYLE separately
 */
if ((!terminate) &&
    (htmlParseLookupSequence(ctxt, '<', '/', 0, 0)
< 0))
        goto done;
htmlParseScript(ctxt);


This code makes sure that there is an end tag starting
somewhere in the
buffer that is going to be processed by htmlParseScript().
However, in
recovery mode, htmlParseScript() will consume the
"</" characters if the
real CDATA end tag is not fully inside the current chunk
(like described
in the problem report). 

I don't have a patch recommendation for the moment but I
see two
possibilities:

a) htmlParseTryOrFinish() could guarantee that the buffer
contains the
desired close tag (or terminate is true). I guess that this
could be
done using multiple htmlParseLookupSequence() calls and
checking for the
tag name in a loop...?

b) htmlParseScript would have to be more powerful in order
to recognize
that it is trying to do xmlStrncasecmp() on an incomplete
tag string. In
that case it should break and be called again by
htmlParseTryOrFinish().
That on the other hand would have to be more careful with
the switch to
the end tag processing after the call to htmlParseScript().

Possibility a) looks better to me and might try to implement
a patch
example.

Cyrill
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
HTML Parser problems with chunk parser if HTML keywords overlap chunk border
user name
2006-06-21 14:55:36
On Wed, Jun 21, 2006 at 04:29:56PM +0200, Cyrill Osterwalder
wrote:
> Hi all
> 
> After some more research I believe to have found the
reason for the
> problem with the CDATA parsing. In case
PARSE_HTML_RECOVER is true, the
> following criteria in htmlParseTryOrFinish() is not
enough for calling
> htmlParseScript():
> 
> /*
>  * Handle SCRIPT/STYLE separately
>  */
> if ((!terminate) &&
>     (htmlParseLookupSequence(ctxt, '<', '/', 0,
0) < 0))
>         goto done;
> htmlParseScript(ctxt);
> 
> 
> This code makes sure that there is an end tag starting
somewhere in the
> buffer that is going to be processed by
htmlParseScript(). However, in
> recovery mode, htmlParseScript() will consume the
"</" characters if the
> real CDATA end tag is not fully inside the current
chunk (like described
> in the problem report). 

  True. I was think about something like that. This is all
due to 
script and style having different parsing constraints.
  Why do you use PARSE_HTML_RECOVER ? The parser is already
doing recovery
mode to some extend without them (I mean the HTML parser
.

> I don't have a patch recommendation for the moment but
I see two
> possibilities:
> 
> a) htmlParseTryOrFinish() could guarantee that the
buffer contains the
> desired close tag (or terminate is true). I guess that
this could be
> done using multiple htmlParseLookupSequence() calls and
checking for the
> tag name in a loop...?

  Hum, well we could check for the current element and make
2 specific
tests in that case. This would be very hard anywy people are
gonna come
with '</ style' or '</foo> and expect taht to
close the open tag, and
 'style "</" style' and expect to not close
it...
  
> b) htmlParseScript would have to be more powerful in
order to recognize
> that it is trying to do xmlStrncasecmp() on an
incomplete tag string. In
> that case it should break and be called again by
htmlParseTryOrFinish().
> That on the other hand would have to be more careful
with the switch to
> the end tag processing after the call to
htmlParseScript().

  Not sure it's much better

> Possibility a) looks better to me and might try to
implement a patch
> example.

  You can try, but it's all very messy IMHO, I will take
patches if not
obviously broken (could be a good idea to provide examples
for the test
suite too).

   thanks

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillardredhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ |
Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )