On Feb 25, 2008, at 3:00 AM, Vladimir Vlach wrote:
> I wonder if you could suggest me a way how to retrieve
> Similar documents and Duplicates. We index few
web-sites and sometimes
> the documents are posted with different URLs. How to
solve this?
Off the top of my head, I don't know of an easy or reliable
approach.
I'm sure that there is academic research out there on the
subject.
Brainstorming...
This is a two-stage problem. The hard part is identifying
candidates
which may be similar to each other. After you have
candidates, then
you can roll through the seemingly matching docs and see
what kind of
matching content is really there. Is it boilerplate
template code
(e.g. nav menus) that ought to be discarded? Or is this
truly
meaningful content which has been duplicated in multiple
locations?
Say you were to build a pure vector space search engine, as
described
at <http://www.perl.com/pub/a/2003/02/19/engine.html>.
Then you
perform a search using the entire contents of one document
as a
query. Documents with duplicate content will appear nearly
on top of
each other in vector space.
An uncompressed vector space search engine is not feasible
for large
document collections; however, I suspect that a decomposed
vector
engine a la LSA (latent semantic analysis) would do a good
job at
picking candidates. An excellent introduction to LSA is
available at <htt
p://www.knowledgesearch.org/lsi/cover_page.htm
>. (I've started collecting these links on a wiki page
at <http://www.rectangular.com/kinosearch/wiki/VectorSpac
eModel
>.)
The patent on Latent Semantic Analysis expires this year.
It ought to
be possible to extend KinoSearch with a KSx::LSA distro,
which would
include KSx::LSA::LSAWriter, KSx::LSA::LSAQuery and so on.
> One of the issues we also have is not related to
KinoSearch. We would
> like to remove some parts of the page which are similar
(let's say we
> want to remove navigation menu shared on all pages).
Remove the
> content is quite easy, but how would you detect what
parts are
> repeated across pages? Diff algorithm? What kind of
approach would you
> suggest?
I haven't studied this one in depth; from what I understand
it's quite
a difficult problem. (I vaguely recall a discussion in some
Lucene
forum where Andrzej Bialecki, one of Lucene's biggest
contributors,
threw up his hands.) Especially annoying is template code
which
varies subtly, making verification of suspected boilerplate
a
challenging prospect. I can think of some vector-based
techniques I
might try, but hunting down academic research on the topic
is likely
to be more fruitful.
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|