List Info

Thread: Anyone looked for a better HTML parser?




Anyone looked for a better HTML parser?
country flaguser name
United States
2007-10-15 15:44:53

I've spent quite a bit of time working with both Neko and
Tagsoup, and they
both have some fairly serious bugs:

Neko has some occasional hangs, and it doesn't deal very
well with a fair
amount of "bad" HTML that displays just fine in a
browser. 

Tagsoup is better in terms of handling "bad" HTML,
but it has a pretty
serious bug in that HTML character entities are expanded in
inappropriate
places, e.g. inside of hrefs, so that a dynamic URL of the
form
http://www.foo.c
om/bar?x=1&sub=5 has problems: the &sub is
interpreted as an
HTML character entity, and an invalid href is created.  John
Cowan, the
author of Tagsoup, more or less said "yeah, I know,
everybody mentions that,
but that's done at such a low level in the code it's not
likely to get fixed
any time soon". (See a discussion of this and other
issues at
http://tech.groups.yahoo.com/group/tagsoup-friends
/message/838). 

The tagsoup bug affects some 3-4% of the sites in my index,
so I consider it
fatal, and I *know* Neko misses some text, sometimes entire
documents,
because it can't deal with pathological HTML.

Has anyone (a) got local fixes for any of these problems, or
(b) found a
superior Java HTML parser out there?

Doug
-- 
View this message in context: http://www.nabble.com/Anyo
ne-looked-for-a-better-HTML-parser--tf4630266.html#a13221500

Sent from the Nutch - Dev mailing list archive at
Nabble.com.


Re: Anyone looked for a better HTML parser?
country flaguser name
Finland
2007-10-16 08:50:41
Doug Cook wrote:
> The tagsoup bug affects some 3-4% of the sites in my
index, so I consider it
> fatal, and I *know* Neko misses some text, sometimes
entire documents,
> because it can't deal with pathological HTML.

Do you have urls of such bad content available to look at?

-- 
 Sami Siren

Re: Anyone looked for a better HTML parser?
country flaguser name
United States
2007-10-16 09:54:51


Sami Siren-2 wrote:
> 
> 
> Do you have urls of such bad content available to look
at?
> 
> 

Thousands. Here is one:

h
ttp://www.valtravieso.com/ver_finca.phtml?idioma=1

The hrefs that have &sub in them get interpreted as
the subset character
by tagsoup, and thus become broken links. With a few sites
(and I think this
is one) the number of URLs will grow ad infinitum if the
site handles the
"broken link" by returning something that works
and uses the input link as a
base.

I believe I have some examples of Neko problems around as
well, I've been
gathering test cases...

 -Doug
-- 
View this message in context: http://www.nabble.com/Anyo
ne-looked-for-a-better-HTML-parser--tf4630266.html#a13235164

Sent from the Nutch - Dev mailing list archive at
Nabble.com.


Re: Anyone looked for a better HTML parser?
country flaguser name
Poland
2007-10-17 07:12:53
I looked at TagSoup sources and it seems it could be quite
easily fixed. See here:

https
://issues.apache.org/jira/browse/NUTCH-567

D.

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )