List Info

Thread: KinoSearch & Similar/Duplicate Documents




KinoSearch & Similar/Duplicate Documents
user name
2008-02-25 05:00:43
Hello !

I love to use KinoSearch. So far It's doing everything we
need for our
project. I wonder if you could suggest me a way how to
retrieve
Similar documents and Duplicates. We index few web-sites and
sometimes
the documents are posted with different URLs. How to solve
this?

One of the issues we also have is not related to KinoSearch.
We would
like to remove some parts of the page which are similar
(let's say we
want to remove navigation menu shared on all pages). Remove
the
content is quite easy, but how would you detect what parts
are
repeated across pages? Diff algorithm? What kind of approach
would you
suggest?

Thank you,
Vlad

_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: KinoSearch & Similar/Duplicate Documents
country flaguser name
United States
2008-02-25 08:49:23
On Feb 25, 2008, at 3:00 AM, Vladimir Vlach wrote:

> I wonder if you could suggest me a way how to retrieve
> Similar documents and Duplicates. We index few
web-sites and sometimes
> the documents are posted with different URLs. How to
solve this?

Off the top of my head, I don't know of an easy or reliable
approach.   
I'm sure that there is academic research out there on the
subject.

Brainstorming...

This is a two-stage problem.  The hard part is identifying
candidates  
which may be similar to each other.  After you have
candidates, then  
you can roll through the seemingly matching docs and see
what kind of  
matching content is really there.  Is it boilerplate
template code  
(e.g. nav menus) that ought to be discarded?  Or is this
truly  
meaningful content which has been duplicated in multiple
locations?

Say you were to build a pure vector space search engine, as
described  
at <http://www.perl.com/pub/a/2003/02/19/engine.html>. 
Then you  
perform a search using the entire contents of one document
as a  
query.  Documents with duplicate content will appear nearly
on top of  
each other in vector space.

An uncompressed vector space search engine is not feasible
for large  
document collections;   however, I suspect that a decomposed
vector  
engine a la LSA (latent semantic analysis) would do a good
job at  
picking candidates.  An excellent introduction to LSA is
available at <htt
p://www.knowledgesearch.org/lsi/cover_page.htm 
 >.  (I've started collecting these links on a wiki page
at <http://www.rectangular.com/kinosearch/wiki/VectorSpac
eModel 
 >.)

The patent on Latent Semantic Analysis expires this year. 
It ought to  
be possible to extend KinoSearch with a KSx::LSA distro,
which would  
include KSx::LSA::LSAWriter, KSx::LSA::LSAQuery and so on.

> One of the issues we also have is not related to
KinoSearch. We would
> like to remove some parts of the page which are similar
(let's say we
> want to remove navigation menu shared on all pages).
Remove the
> content is quite easy, but how would you detect what
parts are
> repeated across pages? Diff algorithm? What kind of
approach would you
> suggest?

I haven't studied this one in depth; from what I understand
it's quite  
a difficult problem.  (I vaguely recall a discussion in some
Lucene  
forum where Andrzej Bialecki, one of Lucene's biggest
contributors,  
threw up his hands.)  Especially annoying is template code
which  
varies subtly, making verification of suspected boilerplate
a  
challenging prospect.  I can think of some vector-based
techniques I  
might try, but hunting down academic research on the topic
is likely  
to be more fruitful.

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: KinoSearch & Similar/Duplicate Documents
country flaguser name
United States
2008-02-25 13:56:20

Vladimir Vlach wrote on 2/25/08 5:00 AM:
> Hello !
> 
> I love to use KinoSearch. So far It's doing everything
we need for our
> project. I wonder if you could suggest me a way how to
retrieve
> Similar documents and Duplicates. We index few
web-sites and sometimes
> the documents are posted with different URLs. How to
solve this?
> 

Duplicates can be identified simply by MD5-ing the doc
content. That's what 
Swish-e's spider.pl does.

Similarity is a much tougher nut. LSA is a decent approach
(as Marvin 
suggested). One Swish-e user tried this:

http://s
wish-e.org/archive/2005-02/8967.html

The key imo is to avoid indexing duplicate and
for-some-value-of-similar 
documents in the first place. Implement these features at
the document 
aggregator level, before handing them to KS.


> One of the issues we also have is not related to
KinoSearch. We would
> like to remove some parts of the page which are similar
(let's say we
> want to remove navigation menu shared on all pages).
Remove the
> content is quite easy, but how would you detect what
parts are
> repeated across pages? Diff algorithm? What kind of
approach would you
> suggest?

If you have control over the content, you might add <!--
noindex --> tags around 
the stuff you want excluded, and then s/// that out before
you pass to KS.

If you don't have control, and the improvement is worth your
time, consider 
identifying some text patterns in your documents and just
s/// those, as in the 
example above.

-- 
Peter Karman  .  http://peknet.com/  . 
peterpeknet.com

_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: KinoSearch & Similar/Duplicate Documents
user name
2008-02-26 00:30:52
On 2/25/08, Vladimir Vlach <vladamangmail.com> wrote:
>  One of the issues we also have is not related to
KinoSearch. We would
>  like to remove some parts of the page which are
similar (let's say we
>  want to remove navigation menu shared on all pages).
Remove the
>  content is quite easy, but how would you detect what
parts are
>  repeated across pages? Diff algorithm? What kind of
approach would you
>  suggest?

I recently was talking with a friend about how to do this
for indexing
a blog aggregator.   For his case, a straight 'diff' type
algorithm
wasn't going to work very well due to rotating ads and page
specific
navigation.   Peter's suggestions (custom regexps) make good
sense if
you have if you have control of the pages or have a set
number of
sites which you are scraping.

Another approach would be to do the analysis at the DOM
level rather
than the text level.  There's an HTML::ContentExtractor
module that
might be a good starting point for this:
<http://search.cpan.org/~jzh
ang/HTML-ContentExtractor/lib/HTML/ContentExtractor.pm>
It does DOM parsing, and makes simple statistical guesses
about what
is real content and what is junk based on the percentage of
text to
tags.   With a better (or per site customized) algorithm
for
classification, I think this has potential.

For my friend, it was possible that http://dapper.net was going to
be
useful as well.  Dapper is a web service that lets you
create
customized RSS feeds of sites based on graphically entered
parameters.
 Probably not going to work for your needs, but might be
worth
checking out for ideas.

Good luck!

Nathan Kurz
nateverse.com

_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )