I've spent quite a bit of time working with both Neko and
Tagsoup, and they
both have some fairly serious bugs:
Neko has some occasional hangs, and it doesn't deal very
well with a fair
amount of "bad" HTML that displays just fine in a
browser.
Tagsoup is better in terms of handling "bad" HTML,
but it has a pretty
serious bug in that HTML character entities are expanded in
inappropriate
places, e.g. inside of hrefs, so that a dynamic URL of the
form
http://www.foo.c
om/bar?x=1&sub=5 has problems: the &sub is
interpreted as an
HTML character entity, and an invalid href is created. John
Cowan, the
author of Tagsoup, more or less said "yeah, I know,
everybody mentions that,
but that's done at such a low level in the code it's not
likely to get fixed
any time soon". (See a discussion of this and other
issues at
http://tech.groups.yahoo.com/group/tagsoup-friends
/message/838).
The tagsoup bug affects some 3-4% of the sites in my index,
so I consider it
fatal, and I *know* Neko misses some text, sometimes entire
documents,
because it can't deal with pathological HTML.
Has anyone (a) got local fixes for any of these problems, or
(b) found a
superior Java HTML parser out there?
Doug
--
View this message in context: http://www.nabble.com/Anyo
ne-looked-for-a-better-HTML-parser--tf4630266.html#a13221500
Sent from the Nutch - Dev mailing list archive at
Nabble.com.
|