List Info

Thread: HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border




HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
user name
2006-06-22 08:57:02
I suppose I found the reason why chunked CDATA parsing also
fails
without the special recovery mode:

If the chunk actually ends with "</", then
htmlParseTryOrFinish() calls
htmlParseScript() to process it. In there, the normal break
condition is
coded as follows:

if ((cur == '<') && (NXT(1) == '/')) {
    if (((NXT(2) >= 'A') && (NXT(2) <=
'Z')) ||
        ((NXT(2) >= 'a') && (NXT(2) <=
'z')))
    {
        break; /* while */
    }
}

However, NXT(2) is not guaranteed to be available. So it
will not break
but consume the "</", which leads to a broken
CDATA parsing in all
cases, even without PARSE_HTML_RECOVER being set. This could
be solved
by avoiding calling htmlParseScript() with a chunk ending
with "</". 

The case with the CDATA recovery option is even more
complicated.

I wonder what you think if we would check in
htmlParseTryOrFinish() that
the last 8 characters of the chunk do not include
"</" before calling
htmlParseScript() in order to solve both cases? Assuming we
are in a
CDATA block being followed by at least one real end tag and
other tags
afterwards this should be safe, shouldn't it?

Cyrill

PS: Please let me know if such detailed source code
discussions are not
supposed to be done on the list
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
user name
2006-06-22 09:24:19
On Thu, Jun 22, 2006 at 10:57:02AM +0200, Cyrill Osterwalder
wrote:
> 
> I suppose I found the reason why chunked CDATA parsing
also fails
> without the special recovery mode:
> 
> If the chunk actually ends with "</",
then htmlParseTryOrFinish() calls
> htmlParseScript() to process it. In there, the normal
break condition is
> coded as follows:
> 
> if ((cur == '<') && (NXT(1) == '/')) {
>     if (((NXT(2) >= 'A') && (NXT(2) <=
'Z')) ||
>         ((NXT(2) >= 'a') && (NXT(2) <=
'z')))
>     {
>         break; /* while */
>     }
> }
> 
> However, NXT(2) is not guaranteed to be available. So
it will not break
> but consume the "</", which leads to a
broken CDATA parsing in all
> cases, even without PARSE_HTML_RECOVER being set. This
could be solved
> by avoiding calling htmlParseScript() with a chunk
ending with "</". 
> 
> The case with the CDATA recovery option is even more
complicated.
> 
> I wonder what you think if we would check in
htmlParseTryOrFinish() that
> the last 8 characters of the chunk do not include
"</" before calling
> htmlParseScript() in order to solve both cases?
Assuming we are in a
> CDATA block being followed by at least one real end tag
and other tags
> afterwards this should be safe, shouldn't it?

  I think delaying calling the parser if "</"
is present in the last 8 
character would be somewhat broken. You could perfectly find
a number of
other elements after the script/style block (actually I
would expect that)
and those need to be closed.
  What should be checked is probably that there is more than
8 characters
in the buffer for consumption there (i.e. avail >=8),
that should be safe:
   - it garantee we can test for the tag name
   - a style or script is unlikely to be at the very end of
an HTML document
     (and if it is it we would have terminate), plus it's
not yet displayable
     content so waiting for the next packet should not
generate a degradation
     there.

  Can you test by changing the condition to:

                    if ((!terminate) &&
                        ((htmlParseLookupSequence(ctxt,
'<', '/', 0, 0) < 0) ||
                         (avail < 8)))
                        goto done;

in that "Handle SCRIPT/STYLE separately" section
and report ? If positive 
provide a contextual patch 

> PS: Please let me know if such detailed source code
discussions are not
> supposed to be done on the list

  that's fine, that's where the knowledge should be
shared!

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillardredhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ |
Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )