Mladen Adamovic wrote:
> Hi!
>
> I want to get more insight into various search engine
algorithms. I
> have wide knowledge of standard data structures &
algorithms
> (hashvalues, trees, graphs, etc.). I thought that
Lucene would be
> good place to start to seek for information and indeed
I've found some
> decent information at Nutch website. However, I decided
to post here
> some personal opinions regarding this issue thinking
that someone
> might give me even more information.
>
> As far as I understand I should read books about
Informational
> Retrieval (i.e. Modern Information Retrieval by
Balza-Yates,
> Ribero-Neto). Any update?
>
> I also found using one article about link spam and
citeseer wide
> articles about link spam techniques, namely:
> 1. Undue Influence: Eliminating the Impact of Link
Plagiarism on Web
> Search Rankings
> 2. Using Rank Propagation and Probabilistic Counting
for LinkBased
> Spam Detection
> 3. SpamRank Fully Automatic Link Spam Detection
> 4. Identifying Link Farm Spam Pages
> 5. Thwarting the Nigritude Ultramarine: Learning to
Identify Link Spam
Yes, good references. At this moment most of my working
knowledge about
search engines comes either from the book you cited above,
or from
papers found on Citeseer - play around with IR related
terms, you will
find a LOT of papers to read... ;). And then follow
references from
those papers ...
I also found that other printed books are either too
outdated or not so
relevant to web-scale IR.
In the end (as usually) the best way to really dig into the
subject is
to try and solve a real-life problem, combining the tools
you already
have and what you have learned.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
Contact: info at sigram dot com
|