List Info

Thread: HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border




HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border
user name
2006-06-22 10:46:29
>I think delaying calling the parser if
"</" is present in 
>the last 8 character would be somewhat broken. 
>You could perfectly find a number of
>other elements after the script/style block (actually I
would 
>expect that)
>and those need to be closed.

I see your point. However, I'm not sure that it wouldn't
work. If we
wait until we have a chunk that does not have
"</" in the trailing 8
characters and we call htmlParseScript() at that point, it
should be
guaranteed that htmlParseScript() either reaches its
breaking condition
or just consumes normal CDATA. If there are other elements
after the
script/style block, they will be parsed correctly once
htmlParseScript()
breaks, wouldn't they?

>What should be checked is probably that there is more
than 
>8 characters in the buffer for consumption there (i.e.
avail >=8), that

>should be safe:
>  - it garantee we can test for the tag name
>  - a style or script is unlikely to be at the very end
of 
>    an HTML document

What about a chunk that contains more than 8 CDATA
characters (avail >=
8 would be true) but ends with "</" after the
CDATA block?

Example of two chunks (without quotes):

Chunk1: "<html><body><script>var
12345678;</"
Chunk2:
"script>normal-text</body></html>"
;

>From my point of view, htmlParseScript() would fail to
parse correctly
in this case even with the condition (avail >=8).

Your suggestion would probably work if we would require at
least 8
characters to parse in htmlParseScript() if
"</" is encountered. This
would make sure that we can decide whether to break or not.
But this
would have to be true for each "</" (in case
of recovery). The
assumption to require more than 8 characters in this case
should be
safe. The parser should stay in the CONTENT status and we
would get
another chance when the next chunk comes in. Doing this just
in my head
is a bit challenging... I might test it  

Cyrill
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )