[ http://issues.apache.org/jira/browse/
NUTCH-92?page=comments#action_12453682 ]
Dogacan Güney commented on NUTCH-92:
------------------------------------
Here is my second attempt at this. Now
DistributedSearch$Client keeps a mapping from addresses to
numDocs, and in search(), computes total number of documents
from live servers.
> DistributedSearch incorrectly scores results
> --------------------------------------------
>
> Key: NUTCH-92
> URL: http://
issues.apache.org/jira/browse/NUTCH-92
> Project: Nutch
> Issue Type: Bug
> Components: searcher
> Affects Versions: 0.7, 0.8
> Reporter: Andrzej Bialecki
> Assigned To: Andrzej Bialecki
> Attachments: distributed-idf-v2.patch,
distributed-idf.patch
>
>
> When running search servers in a distributed setup,
using DistributedSearch$Server and Client, total scores are
incorrectly calculated. The symptoms are that scores differ
depending on how segments are deployed to Servers, i.e. if
there is uneven distribution of terms in segment indexes
(due to segment size or content differences) then scores
will differ depending on how many and which segments are
deployed on a particular Server. This may lead to
prioritizing of non-relevant results over more relevant
ones.
> The underlying reason for this is that each
IndexSearcher (which uses local index on each Server)
calculates scores based on the local IDFs of query terms,
and not the global IDFs from all indexes together. This
means that scores arriving from different Servers to the
Client cannot be meaningfully compared, unless all indexes
have similar distribution of Terms and similar numbers of
documents in them. However, currently the Client mixes all
scores together, sorts them by absolute values and picks top
hits. These absolute values will change if segments are
un-evenly deployed to Servers.
> Currently the workaround is to deploy the same number
of documents in segments per Server, and to ensure that
segments contain well-randomized content so that term
frequencies for common terms are very similar.
> The solution proposed here (as a result of discussion
between ab and cutting, patches are coming) is to calculate
global IDFs prior to running the query, and pre-boost query
Terms with these global IDFs. This will require one more RPC
call per each query (this can be optimized later, e.g.
through caching). Then the scores will become normalized
according to the global IDFs, and Client will be able to
meaningfully compare them. Scores will also become
independent of the segment content or local number of
documents per Server. This will involve at least the
following changes:
> * change NutchSimilarity.idf(Term, Searcher) to always
return 1.0f. This enables us to manipulate scores
independently of local IDFs.
> * add a new method to Searcher interface, int[]
getDocFreqs(Term[]), which will return document frequencies
for query terms.
> * modify getSegmentNames() so that it returns also the
total number of documents in each segment, or implement this
as a separate method (this will be called once during
segment init)
> * in DistributedSearch$Client.search() first make a
call to servers to return local IDFs for the current query,
and calculate global IDFs for each relevant Term in that
query.
> * multiply the TermQuery boosts by idf(totalDocFreq,
totalIndexedDocs), and PhraseQuery boosts by the sum of the
idf(totalDocFreqs, totalIndexedDocs) for all of its terms
> This solution should be applicable with only minor
changes to all branches, but initially the patches will be
relative to trunk/ .
> Comments, suggestions and review are welcome!
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|