Dennis Kubes wrote:
>> That's a very nice description - thanks, Dennis. I
think it would be
>> useful to include it on the Wiki as a case study.
>
> I will polish it up a bit and put it out there.
Great, thanks.
>>> This is all dependent on the size of each local
index. Approximately
>>> 2-4M pages per index split is good. Over that
you may see
>>> performance decreases. Scaling that out over
many servers you will
>>> see almost linear response time. We have
almost 100M pages in the
>>> index and are seeing subsecond response times
on most queries.
>>
>> Are you running with a sorted index, and using
non-zero
>> searcher.max.hits? If you use a well-defined
PR-like scoring, then
>> using this feature could make wonders to the
performance, and increase
>> the max number of docs per server.
>
> I don't know about the sorted index. How do I learn
about that?
>
> We basically took the current indexer and extended it
to split into
> parts. The indexer also splits the segements and
linkdb into the same
> parts so all data for a single url will be in the same
split on the same
> search server. We are using searcher.max.hits at 1000
and we did see a
> performance increase from that.
If you're using non-zero searcher.max.hits with un-sorted
indexes, your
ranking will be broken, i.e. the code in
LuceneQueryOptimizer will make
wrong assumptions about the extrapolation of scores for
skipped
documents. This feature strongly relies on having indexes
sorted by
PageRank score - see the IndexSorter tool for details. If
you don't sort
the index by PageRank, you should set this property to <=
0.
Try also upgrading Nutch to Lucene 2.2.0, this alone should
give you a
performance boost of a few percent (if Lucene indeed is the
bottleneck).
See also my (long) rant about the complexity of Nutch
queries:
http://www.nabble.
com/Performance-optimization-for-Nutch-index---query-tf32763
16.html#a9111523
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||/| Information Retrieval, Semantic Web
___|||__|| | || | Embedded Unix, System Integration
http://www.sigram.com
Contact: info at sigram dot com
|