List Info

Thread: HTML Parser problems with chunk parser if HTMLkeywordsoverlap chunk border




HTML Parser problems with chunk parser if HTMLkeywordsoverlap chunk border
user name
2006-06-22 11:50:04
Hi Daniel

Do attachments of contextual patch files work with the list?


Anyway, I appended the contextual patch of my first fix
attempt at the
end of this email. The first few tests here are now running
successfully, especially the known problem cases that I
could reproduce
do not occur anymore. I'm going to test some more cases,
involving
special situations around the closing CDATA tags. You
mentioned the test
suite... how do people contribute and where?

The big question is now: Does everything else still work as
expected?


I guess we could also use the htmlParseLookupSequence() with
the
appropriate checkIndex being set instead of looking for the
chars
manually. On the other hand that seems to be an overhead.

The patch is based on HTMLparser.c of libxml2-2.6.24.

Cyrill



*** HTMLparser.c.orig	Thu Mar  9 14:19:53 2006
--- HTMLparser.c	Thu Jun 22 13:34:11 2006
***************
*** 4936,4948 ****
  		cons = ctxt->nbChars;
  		if ((xmlStrEqual(ctxt->name,
BAD_CAST"script")) ||
  		    (xmlStrEqual(ctxt->name,
BAD_CAST"style"))) {
  		    /*
  		     * Handle SCRIPT/STYLE separately
  		     */
  		    if ((!terminate) &&
  		        (htmlParseLookupSequence(ctxt, '<', '/',
0, 0) <
0))
  			goto done;
! 		    htmlParseScript(ctxt);
  		    if ((cur == '<') && (next == '/')) {
  			ctxt->instate = XML_PARSER_END_TAG;
  			ctxt->checkIndex = 0;
--- 4936,4976 ----
  		cons = ctxt->nbChars;
  		if ((xmlStrEqual(ctxt->name,
BAD_CAST"script")) ||
  		    (xmlStrEqual(ctxt->name,
BAD_CAST"style"))) {
+ 			int ntrailing, trailing_pos, i;
+ 
  		    /*
  		     * Handle SCRIPT/STYLE separately
  		     */
  		    if ((!terminate) &&
  		        (htmlParseLookupSequence(ctxt, '<', '/',
0, 0) <
0))
  			goto done;
! 
! 			/* 
! 			 * First CDATA parsing fix attempt by Cyrill
Osterwalder:
! 			 * 
! 			 * Guarantee that last 8 chars of this chunk 
! 			 * do not contain '</' if this is not 
! 			 * terminating round. We need this for
htmlParseScript()
! 			 * to find the CDATA termination criteria in
special cases
! 			 * where the end tag is overlapping the chunk
boundary.
! 			 * Requiring this inside our script/style CDATA
block should
! 			 * be safe, other elements will be parsed once
we get back
! 			 * from htmlParseScript().
! 			 * */
! 			ntrailing = (avail > 8) ? 8 : avail;
! 			trailing_pos = avail - ntrailing;
! 			for (i = 0; i < ntrailing - 1; i++) {
! 				if (!terminate
! 						&& in->cur[trailing_pos
+ i] == '<'
! 						&& in->cur[trailing_pos
+ i + 1] == '/') {
! 					/* there is a '</' in the last 8
chars,
! 					 * we require more characters
! 					 * */
! 					goto done;
! 				}
! 			}
! 
! 			htmlParseScript(ctxt);
  		    if ((cur == '<') && (next == '/')) {
  			ctxt->instate = XML_PARSER_END_TAG;
  			ctxt->checkIndex = 0;
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
HTML Parser problems with chunk parser if HTMLkeywordsoverlap chunk border
user name
2006-06-22 13:46:53
On Thu, Jun 22, 2006 at 01:50:04PM +0200, Cyrill Osterwalder
wrote:
> Hi Daniel
> 
> Do attachments of contextual patch files work with the
list? 

  please use an attachment, not in the mail body, mailers
breaks 
body content.

> Anyway, I appended the contextual patch of my first fix
attempt at the
> end of this email. The first few tests here are now
running
> successfully, especially the known problem cases that I
could reproduce
> do not occur anymore. I'm going to test some more
cases, involving
> special situations around the closing CDATA tags. You
mentioned the test
> suite... how do people contribute and where?

  provide test example as attachmnent too, I will plug them
in test/HTML

> The big question is now: Does everything else still
work as expected?
> 
> 
> I guess we could also use the htmlParseLookupSequence()
with the
> appropriate checkIndex being set instead of looking for
the chars
> manually. On the other hand that seems to be an
overhead.

  I would prefer the patch to use a second
htmlParseLookupSequence() yes
that would be cleaner,

   thanks,

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillardredhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ |
Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )