List Info

Thread: Re: Multiple Indices vs Single Index




Re: Multiple Indices vs Single Index
country flaguser name
United States
2007-09-21 00:37:04
Thanks Grant and Chris for the replies.

I am looking at a single index because the 40 index system
has started having performance issues at high load.  My
daily traffic is increasing at a steady pace and about 40%
of the traffic is concentrated in a 2 hour period and
searches start slowing down just a little bit.

All my indices have exactly the same fields, so I guess
fieldcache and sorting are not an issue.  I do not do any
updates, so I just open 40 searchers and store them in a
HashMap.  When a query comes, I select the appropriate
searcher and fire the query.  So code maintainance also is
not an issue.

The different IDF statistics are a valid point.  I have not
analysed the difference in the results returned - I sort of
assumed the results will be same in both cases.  I will look
at the difference in results.


There is one more point.  Sooner or later, I will have to
move to multiple servers due to increasing traffic.  (I will
be handling 30,000+ hits per hour by the end of the year) 
One option is to have full index on all servers and
load-balance them.  Another is to have half the indices on
one server and half of them on the other.  The front-end
(separate server) will then fire the query on the
appropriate server.  Any suggestions on which one would be a
better choice ?  All data on all servers will give me
redundancy, the system will be up even if one server goes
down.  Also, adding more servers would be trivial.


Thanks,
Nikhil


----- Original Message ----
From: Chris Hostetter <hossman_lucenefucit.org>
To: java-userlucene.apache.org
Sent: Friday, 21 September, 2007 4:02:38 AM
Subject: Re: Multiple Indices vs Single Index


: I was wondering if it will be better to just have 1 large
index with all 
: the 40 indices combined.  I do not need to do dual-queries
and my total 
: index size (if I create a single index) is about 3.4GB. 
It will 
: increase to maximum of 5-6 GB.  I am running this on a
dedicated machine 
: with 8GB RAM.

off the top of my head, there are 3 main reasons i can think
of that would 
motivate one choice over another -- ultimately it's up to
you...

1) FieldCache and sorting ... if all 40 sets of of Documents
contain 
have consistently named fields, then there won't be much
difference 
between 40 indexes and 1 index ... but if each of those 40
sets contain 
documents with radically differnet fields -- and you want to
sort on N 
differnet fields for each sets -- then the total FieldCache
sizes for each 
of those 40 indexes will be smaller then the FieldCaches for
one gian 
index (because every document will get an entry wethe it
makes sense or 
not.

2) idf statistics.  if you have common fields you search
regardless of 
document set, the 40 index approach will maintain seperate
sttistics -- 
this may be important if some terms are very common in only
som docsets.  
the word "albino" may be really common in docset A
but only one doc in 
docset B has it ... in the 40 index appraoch querying B for
(albino 
elephant) will give a lot of weight to albino because it's
so rare, but in 
the single index case albino may not be considered as
significant because 
of ht unified idf value for all docsets 9even if hte query
is constrained 
to docset B) ... again: this only matters if the fields
overlap, if every 
docset has a unique set of fields then the idfs will be
unique because 
they are by field)

3) management: it's probably a lot simpler to maintain and
manage code 
that deals with one index then code that deals with 40
indexes. you milage 
may vary.


-Hoss


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org





------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


RE: Multiple Indices vs Single Index
country flaguser name
United Kingdom
2007-09-21 05:49:29
In a similar scenario, if we had more than 40 of such
grouping, on which the
hits (not lucene hits, but hit for search by users) are not
evenly
distributed, would it be good to have a LRU caching
mechanism for searcher
objects.

Is there any sort of grey are that should be avoided in
doing that?


-----Original Message-----
From: Nikhil Chhaochharia [mailto:nikhil500yahoo.com] 
Sent: 21 September 2007 06:37
To: java-userlucene.apache.org
Subject: Re: Multiple Indices vs Single Index

Thanks Grant and Chris for the replies.

I am looking at a single index because the 40 index system
has started
having performance issues at high load.  My daily traffic is
increasing at a
steady pace and about 40% of the traffic is concentrated in
a 2 hour period
and searches start slowing down just a little bit.

All my indices have exactly the same fields, so I guess
fieldcache and
sorting are not an issue.  I do not do any updates, so I
just open 40
searchers and store them in a HashMap.  When a query comes,
I select the
appropriate searcher and fire the query.  So code
maintainance also is not
an issue.

The different IDF statistics are a valid point.  I have not
analysed the
difference in the results returned - I sort of assumed the
results will be
same in both cases.  I will look at the difference in
results.


There is one more point.  Sooner or later, I will have to
move to multiple
servers due to increasing traffic.  (I will be handling
30,000+ hits per
hour by the end of the year)  One option is to have full
index on all
servers and load-balance them.  Another is to have half the
indices on one
server and half of them on the other.  The front-end
(separate server) will
then fire the query on the appropriate server.  Any
suggestions on which one
would be a better choice ?  All data on all servers will
give me redundancy,
the system will be up even if one server goes down.  Also,
adding more
servers would be trivial.


Thanks,
Nikhil


----- Original Message ----
From: Chris Hostetter <hossman_lucenefucit.org>
To: java-userlucene.apache.org
Sent: Friday, 21 September, 2007 4:02:38 AM
Subject: Re: Multiple Indices vs Single Index


: I was wondering if it will be better to just have 1 large
index with all 
: the 40 indices combined.  I do not need to do dual-queries
and my total 
: index size (if I create a single index) is about 3.4GB. 
It will 
: increase to maximum of 5-6 GB.  I am running this on a
dedicated machine 
: with 8GB RAM.

off the top of my head, there are 3 main reasons i can think
of that would 
motivate one choice over another -- ultimately it's up to
you...

1) FieldCache and sorting ... if all 40 sets of of Documents
contain 
have consistently named fields, then there won't be much
difference 
between 40 indexes and 1 index ... but if each of those 40
sets contain 
documents with radically differnet fields -- and you want to
sort on N 
differnet fields for each sets -- then the total FieldCache
sizes for each 
of those 40 indexes will be smaller then the FieldCaches for
one gian 
index (because every document will get an entry wethe it
makes sense or 
not.

2) idf statistics.  if you have common fields you search
regardless of 
document set, the 40 index approach will maintain seperate
sttistics -- 
this may be important if some terms are very common in only
som docsets.  
the word "albino" may be really common in docset A
but only one doc in 
docset B has it ... in the 40 index appraoch querying B for
(albino 
elephant) will give a lot of weight to albino because it's
so rare, but in 
the single index case albino may not be considered as
significant because 
of ht unified idf value for all docsets 9even if hte query
is constrained 
to docset B) ... again: this only matters if the fields
overlap, if every 
docset has a unique set of fields then the idfs will be
unique because 
they are by field)

3) management: it's probably a lot simpler to maintain and
manage code 
that deals with one index then code that deals with 40
indexes. you milage 
may vary.


-Hoss


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org





------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )