List Info

Thread: stemming




stemming
user name
2006-06-26 13:22:12
Hi,Jerome

I think that the best way is to ask Eugene to share his
code. I hope
he will comply our request... 
I want to believe that his answer will be positive! if not,
then I
will share my "BAD code" to You.

-----------

Regards

Alexey


na> I don't know.
na> Could you please send me off list your code.
na>
na>
na> Jerome

stemming
user name
2006-06-26 19:17:45
Alexey,

Sorry for the delay answering you. I will definitely share
my code
with nutch community, but currently I'm on vacation, away
from my
sources, so I will share them as soon as my vacation ends


P.S. Nutch is great I and I hope that my efforts will help
to make it
better.

P.P.S Why not to develop efficient technique to fight
near-duplicates
and SE spam? This is absolutely necessary if build Internet
search
engine based on nutch. Another "must have" is
variable refetch time
for pages (this could be based on estimating average update
time of
the page + taking into account page score)


> Hi,Jerome

> I think that the best way is to ask Eugene to share his
code. I hope
> he will comply our request... 
> I want to believe that his answer will be positive! if
not, then I
> will share my "BAD code" to You.

> -----------

> Regards

> Alexey

-- 
Best regards,
 Eugen                            mailto:eugenlan23.net

stemming
user name
2006-06-28 09:18:46
Hi,Eugen

I think that is right way.

-----------

Regards,
Alexey



> P.P.S Why not to develop efficient technique to fight
near-duplicates
> and SE spam? This is absolutely necessary if build
Internet search
> engine based on nutch. Another "must have"
is variable refetch time
> for pages (this could be based on estimating average
update time of
> the page + taking into account page score)



> --
> Best regards,
>  Eugen                            mailto:eugenlan23.net

stemming
user name
2006-06-28 10:15:23
Eugen Kochuev wrote:
> P.P.S Why not to develop efficient technique to fight
near-duplicates
> and SE spam? This is absolutely necessary if build
Internet search
>   

Why not, indeed? ;) The answer is that it is very difficult.
There are 
simple methods that Nutch uses (MD5 and "text
profile"), but generally 
speaking it is a difficult task. If you consider that pages
may contain 
elements that are changing daily (such as date) or even with
every 
request (ads, counters, banners, current time), or depending
on the 
context (first request, subsequent requests), or may be
composed from 
reusable parts (portlets), the problem doesn't seem so
trivial anymore.

There is some (not much) literature on the subject, if you
are 
interested I can send you some links - and of course we
would gladly 
welcome any contributions in this area!

> engine based on nutch. Another "must have"
is variable refetch time
> for pages (this could be based on estimating average
update time of
> the page + taking into account page score)
>   

This is more or less ready to be committed. As it was
discussed earlier 
on nutch-dev, since this is a significant change I'm
waiting with the 
commit until after the release.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com


[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )