List Info

Thread: Re: Distributed index




Re: Distributed index
country flaguser name
United States
2007-06-21 10:31:20

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
> 
>> 100 million pages = 50-100 servers and 20-40T of
space distributed. 
>> Ideally the setup would be processing machines and
search servers.  You 
> 
> [..]
> 
> That's a very nice description - thanks, Dennis. I
think it would be 
> useful to include it on the Wiki as a case study.

I will polish it up a bit and put it out there.

> 
> 
>> This is all dependent on the size of each local
index.  Approximately 
>> 2-4M pages per index split is good.  Over that you
may see performance 
>> decreases.  Scaling that out over many servers you
will see almost 
>> linear response time.  We have almost 100M pages in
the index and are 
>> seeing subsecond response times on most queries.
> 
> Are you running with a sorted index, and using non-zero

> searcher.max.hits? If you use a well-defined PR-like
scoring, then using 
> this feature could make wonders to the performance, and
increase the max 
> number of docs per server.

I don't know about the sorted index.  How do I learn about
that?

We basically took the current indexer and extended it to
split into 
parts.  The indexer also splits the segements and linkdb
into the same 
parts so all data for a single url will be in the same split
on the same 
search server.  We are using searcher.max.hits at 1000 and
we did see a 
performance increase from that.

Dennis Kubes

> 
> 

Re: Distributed index
country flaguser name
Poland
2007-06-21 12:59:21
Dennis Kubes wrote:

>> That's a very nice description - thanks, Dennis. I
think it would be 
>> useful to include it on the Wiki as a case study.
> 
> I will polish it up a bit and put it out there.


Great, thanks.


>>> This is all dependent on the size of each local
index.  Approximately 
>>> 2-4M pages per index split is good.  Over that
you may see 
>>> performance decreases.  Scaling that out over
many servers you will 
>>> see almost linear response time.  We have
almost 100M pages in the 
>>> index and are seeing subsecond response times
on most queries.
>>
>> Are you running with a sorted index, and using
non-zero 
>> searcher.max.hits? If you use a well-defined
PR-like scoring, then 
>> using this feature could make wonders to the
performance, and increase 
>> the max number of docs per server.
> 
> I don't know about the sorted index.  How do I learn
about that?
> 
> We basically took the current indexer and extended it
to split into 
> parts.  The indexer also splits the segements and
linkdb into the same 
> parts so all data for a single url will be in the same
split on the same 
> search server.  We are using searcher.max.hits at 1000
and we did see a 
> performance increase from that.

If you're using non-zero searcher.max.hits with un-sorted
indexes, your 
ranking will be broken, i.e. the code in
LuceneQueryOptimizer will make 
wrong assumptions about the extrapolation of scores for
skipped 
documents. This feature strongly relies on having indexes
sorted by 
PageRank score - see the IndexSorter tool for details. If
you don't sort 
the index by PageRank, you should set this property to <=
0.

Try also upgrading Nutch to Lucene 2.2.0, this alone should
give you a 
performance boost of a few percent (if Lucene indeed is the
bottleneck).

See also my (long) rant about the complexity of Nutch
queries: 
http://www.nabble.
com/Performance-optimization-for-Nutch-index---query-tf32763
16.html#a9111523


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||/|  Information Retrieval, Semantic Web
___|||__||  |  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )