Eugen Kochuev wrote:
> P.P.S Why not to develop efficient technique to fight
near-duplicates
> and SE spam? This is absolutely necessary if build
Internet search
>
Why not, indeed? ;) The answer is that it is very difficult.
There are
simple methods that Nutch uses (MD5 and "text
profile"), but generally
speaking it is a difficult task. If you consider that pages
may contain
elements that are changing daily (such as date) or even with
every
request (ads, counters, banners, current time), or depending
on the
context (first request, subsequent requests), or may be
composed from
reusable parts (portlets), the problem doesn't seem so
trivial anymore.
There is some (not much) literature on the subject, if you
are
interested I can send you some links - and of course we
would gladly
welcome any contributions in this area!
> engine based on nutch. Another "must have"
is variable refetch time
> for pages (this could be based on estimating average
update time of
> the page + taking into account page score)
>
This is more or less ready to be committed. As it was
discussed earlier
on nutch-dev, since this is a significant change I'm
waiting with the
commit until after the release.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
Contact: info at sigram dot com
|