List Info

Thread: Lucene search optimization




Lucene search optimization
user name
2006-05-30 15:12:07
Hi,

I have 2 million documents, with a name property. (~15 to 20
characters).
Fuzzy searching against this property takes around 3
seconds, which is
way too much for what I plan to do, so I am considering the
possible
optimizations. I can add a property to each of the
documents, that could
partition the document space into 400 spaces. Each space
would then be
limited to 5000 documents, which should be small enough to
make the
fuzzy search faster.

However, my question is : how do I take advantage of this
additional
property ? Using a traditional RDBMS, I would add an index
on the field,
but on Lucene, I'm not sure of how to proceed. Would
filters be the way
to go ?
(http://lucene.apache.org/java/docs/ap
i/org/apache/lucene/search/Filter.html) 
Could a Caching Wrapperfilter help even more ?
(http://lucene.apache.or
g/java/docs/api/org/apache/lucene/search/CachingWrapperFilte
r.html) 

Additionnally, the additional property is an id, so can I
store it as a
number so that it is faster (I guess) than string comparison
?

Thanks a lot,
Sami Dalouche




------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

Lucene search optimization
user name
2006-05-30 15:22:30
Sami,

You're on to the right approach seeking something other
than  
FuzzyQuery.  FuzzyQuery is rarely generally useful and there
are  
other ways to achieve the same sort of thing (soundex,
metaphone) in  
an efficient manner.

If you could share some details about these properties and
how you  
need to query them I'm sure the community could offer
suggestions on  
an efficient and clean implementation.  Without details, its
not  
possible to (easily) know how recommend a specific
technique.

	Erik


On May 30, 2006, at 11:12 AM, Sami Dalouche wrote:

> Hi,
>
> I have 2 million documents, with a name property. (~15
to 20
> characters).
> Fuzzy searching against this property takes around 3
seconds, which is
> way too much for what I plan to do, so I am considering
the possible
> optimizations. I can add a property to each of the
documents, that  
> could
> partition the document space into 400 spaces. Each
space would then be
> limited to 5000 documents, which should be small enough
to make the
> fuzzy search faster.
>
> However, my question is : how do I take advantage of
this additional
> property ? Using a traditional RDBMS, I would add an
index on the  
> field,
> but on Lucene, I'm not sure of how to proceed. Would
filters be the  
> way
> to go ?
> (http://lucene.apache.org/java/docs/api/org/apach
e/lucene/search/ 
> Filter.html)
> Could a Caching Wrapperfilter help even more ?
> (http://lucene.apache.org/java/docs/api/org/apach
e/lucene/search/ 
> CachingWrapperFilter.html)
>
> Additionnally, the additional property is an id, so can
I store it  
> as a
> number so that it is faster (I guess) than string
comparison ?
>
> Thanks a lot,
> Sami Dalouche
>
>
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

Lucene search optimization
user name
2006-05-30 15:55:48
Take a look at "FuzzyLikeThisQuery" in
contrib\queries.

I use it for name searches on large indexes. 
Unlike FuzzyQuery it:
a) limits the number of query terms produced
b) provides better ranking (disables idf factor which
otherwise boosts rare misspellings)

The cost of running a query is strongly related to the
quantity of terms in the query.
FuzzyQuery only limits the number of terms by quality
(which means you can unexpectedly produce a large
quantity of terms and therefore have a slow query).
FuzzyLikeThis is more explicit - it limits the
*quantity* of terms used (and automatically shortlists
to the best quality terms using the same edit-distance
metric as FuzzyQuery for ranking quality). 


Cheers,
Mark



	
	
		
___________________________________________________________ 
All new Yahoo! Mail "The new Interface is stunning in
its simplicity and ease of use." - PC Magazine 
http://uk.doc
s.yahoo.com/nowyoucan.html

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

Lucene search optimization
user name
2006-05-30 19:16:47
Hi,

I didn't want to bother you with the exact details of my
document, but
since you're asking.. 

So, I have the list of all world cities, and would like to
let the users
search for their city, allowing them to do small mistakes.
Additionnally, since cities have sometimes different names,
spellings,
etc (like a small city near mine which is called Le Perray
en Yvelines,
sometimes spellt Le-Perray-en-Yvelines, Le Perray en Ynes,
Le Perray,
etc). 

The way to limit the number of returned documents that I was
thinking of
was to specify the country, which would then divide the
search space,
but if you think of something better, I am open to any
suggestion.

Soundex and metaphones are specific to languages, right ?
Would it work
for cities ?

The cities are available as XML from http://www.sirika.c
om/data/xmlgz/

If you need more information, just ask.
Regards,
Sami Dalouche


Le mardi 30 mai 2006 à 11:22 -0400, Erik Hatcher a écrit :
> Sami,
> 
> You're on to the right approach seeking something
other than  
> FuzzyQuery.  FuzzyQuery is rarely generally useful and
there are  
> other ways to achieve the same sort of thing (soundex,
metaphone) in  
> an efficient manner.
> 
> If you could share some details about these properties
and how you  
> need to query them I'm sure the community could offer
suggestions on  
> an efficient and clean implementation.  Without
details, its not  
> possible to (easily) know how recommend a specific
technique.
> 
> 	Erik
> 
> 
> On May 30, 2006, at 11:12 AM, Sami Dalouche wrote:
> 
> > Hi,
> >
> > I have 2 million documents, with a name property.
(~15 to 20
> > characters).
> > Fuzzy searching against this property takes around
3 seconds, which is
> > way too much for what I plan to do, so I am
considering the possible
> > optimizations. I can add a property to each of the
documents, that  
> > could
> > partition the document space into 400 spaces. Each
space would then be
> > limited to 5000 documents, which should be small
enough to make the
> > fuzzy search faster.
> >
> > However, my question is : how do I take advantage
of this additional
> > property ? Using a traditional RDBMS, I would add
an index on the  
> > field,
> > but on Lucene, I'm not sure of how to proceed.
Would filters be the  
> > way
> > to go ?
> > (http://lucene.apache.org/java/docs/api/org/apach
e/lucene/search/ 
> > Filter.html)
> > Could a Caching Wrapperfilter help even more ?
> > (http://lucene.apache.org/java/docs/api/org/apach
e/lucene/search/ 
> > CachingWrapperFilter.html)
> >
> > Additionnally, the additional property is an id,
so can I store it  
> > as a
> > number so that it is faster (I guess) than string
comparison ?
> >
> > Thanks a lot,
> > Sami Dalouche
> >
> >
> >
> >
> >
------------------------------------------------------------
---------
> > To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> > For additional commands, e-mail:
java-user-helplucene.apache.org
> 
> 
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
> 


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

Lucene search optimization
user name
2006-05-30 20:58:10
: Fuzzy searching against this property takes around 3
seconds, which is
: way too much for what I plan to do, so I am considering
the possible

whenever anyone has a question about how to speed up a
search, and the
current amount of time the search takes is more then a
second, there are a
few questions i allways want to ask:

 1) what method exactly on the Searcher interface are you
using the
    execute the search?
 2) what exactly are you timing? (the time the search method
call takes?,
    the time it takes you to iterate over the results?
etc...)
 3) are you sorting by any particular field?
 4) are you reusing the Searcher instance for more then one
query?   are
    you timing more then one query and taking the average?


-Hoss


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

Lucene search optimization
user name
2006-05-31 07:44:54
Hi,

1) Actually, I am not using Lucene directly, but a wrapper
called
compass. I am using the find() method of the CompassSession,
which code
is :
public CompassHits find(String query) throws
CompassException {
        return
createQueryBuilder().queryString(query).toQuery().hits();
    }
And all of these objects are pure wrappers around lucene
equivalents,
nothing more.


2) What I am timing is only the find call :
-- start timer
CompassHits hits =
compassSession.find("cityName:"+
name+"~");
-- stop timer

3) I am not sorting anything, but lucene is returning the
hits by
relevance. Does this count as sorting ?

4) I tried to time the thing for ~10 queries, and the
results are
roughly the same. Can go down to 2 seconds, which is still
way too
much...

Thanks for helping
sami Dalouche

On Tue, 2006-05-30 at 13:58 -0700, Chris Hostetter wrote:
> : Fuzzy searching against this property takes around 3
seconds, which is
> : way too much for what I plan to do, so I am
considering the possible
> 
> whenever anyone has a question about how to speed up a
search, and the
> current amount of time the search takes is more then a
second, there are a
> few questions i allways want to ask:
> 
>  1) what method exactly on the Searcher interface are
you using the
>     execute the search?
>  2) what exactly are you timing? (the time the search
method call takes?,
>     the time it takes you to iterate over the results?
etc...)
>  3) are you sorting by any particular field?
>  4) are you reusing the Searcher instance for more then
one query?   are
>     you timing more then one query and taking the
average?
> 
> 
> -Hoss
> 
> 
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
> 


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

Lucene search optimization
user name
2006-05-31 08:28:59
>>Actually, I am not using Lucene directly, but a
wrapper called compass


I don't know what controls it offers you then.
One option which could offer a speed up is to raise
the minimum quality match threshold above the default
of 0.5 and use a query string like this:

  cityName:bond~0.8 

This would reduce the number of alternative terms
considered and therefore the query time.


--- Sami Dalouche <skoobifree.fr> wrote:

> Hi,
> 
> 1) Actually, I am not using Lucene directly, but a
> wrapper called
> compass. I am using the find() method of the
> CompassSession, which code
> is :
> public CompassHits find(String query) throws
> CompassException {
>         return
>
createQueryBuilder().queryString(query).toQuery().hits();
>     }
> And all of these objects are pure wrappers around
> lucene equivalents,
> nothing more.
> 
> 
> 2) What I am timing is only the find call :
> -- start timer
> CompassHits hits =
compassSession.find("cityName:"+
> name+"~");
> -- stop timer
> 
> 3) I am not sorting anything, but lucene is
> returning the hits by
> relevance. Does this count as sorting ?
> 
> 4) I tried to time the thing for ~10 queries, and
> the results are
> roughly the same. Can go down to 2 seconds, which is
> still way too
> much...
> 
> Thanks for helping
> sami Dalouche
> 
> On Tue, 2006-05-30 at 13:58 -0700, Chris Hostetter
> wrote:
> > : Fuzzy searching against this property takes
> around 3 seconds, which is
> > : way too much for what I plan to do, so I am
> considering the possible
> > 
> > whenever anyone has a question about how to speed
> up a search, and the
> > current amount of time the search takes is more
> then a second, there are a
> > few questions i allways want to ask:
> > 
> >  1) what method exactly on the Searcher interface
> are you using the
> >     execute the search?
> >  2) what exactly are you timing? (the time the
> search method call takes?,
> >     the time it takes you to iterate over the
> results? etc...)
> >  3) are you sorting by any particular field?
> >  4) are you reusing the Searcher instance for more
> then one query?   are
> >     you timing more then one query and taking the
> average?
> > 
> > 
> > -Hoss
> > 
> > 
> >
>
------------------------------------------------------------
---------
> > To unsubscribe, e-mail:
> java-user-unsubscribelucene.apache.org
> > For additional commands, e-mail:
> java-user-helplucene.apache.org
> > 
> 
> 
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail:
> java-user-unsubscribelucene.apache.org
> For additional commands, e-mail:
> java-user-helplucene.apache.org
> 
> 



		
___________________________________________________________ 
The all-new Yahoo! Mail goes wherever you go - free your
email address from your Internet provider. http://uk.doc
s.yahoo.com/nowyoucan.html

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

Lucene search optimization
user name
2006-05-31 17:30:49
: public CompassHits find(String query) throws
CompassException {
:         return
createQueryBuilder().queryString(query).toQuery().hits();
:     }
: And all of these objects are pure wrappers around lucene
equivalents,
: nothing more.

: 2) What I am timing is only the find call :
: -- start timer
: CompassHits hits =
compassSession.find("cityName:"+
name+"~");
: -- stop timer

ok, but a thin wrapper arround *which* lucene equivilents?
.. there are
many different methods for doing a search in lucene, each
with a different
usage pattern and performance characteristics ... if for
example that code
uses a HitCollector and just pulls back the IDs into the
CompassHits
that's going to be faster then if it gets a Hits obejct and
then iterates
over each Hit storing the full Document in the CompassHits
object --
especially if you've got more then 50 or so results ... in
which case
using a Hits object will acctaully result in your search
being executed
again and again as you iterate farther down the list of
results.

exactly what those methods do can make a big difference.

then again: maybe they don't,  maybe fuzzy queries really
are that slow (i
don't know, i've never used them) I just want to make sure
you think about
those issues.



-Hoss


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

Lucene search optimization
user name
2006-05-31 19:23:57
Hi,

thanks for the tip.. However, my slowness issues do not seem
to be
caused by the number of search results returned, since
cityName:XX~0.8
took 2 seconds to return 2 results....

So, the problem seems to be more related to scanning the
index...

Thanks,
Sami Dalouche

Le mardi 30 mai 2006 à 16:55 +0100, mark harwood a écrit :
> Take a look at "FuzzyLikeThisQuery" in
> contrib\queries.
> 
> I use it for name searches on large indexes. 
> Unlike FuzzyQuery it:
> a) limits the number of query terms produced
> b) provides better ranking (disables idf factor which
> otherwise boosts rare misspellings)
> 
> The cost of running a query is strongly related to the
> quantity of terms in the query.
> FuzzyQuery only limits the number of terms by quality
> (which means you can unexpectedly produce a large
> quantity of terms and therefore have a slow query).
> FuzzyLikeThis is more explicit - it limits the
> *quantity* of terms used (and automatically shortlists
> to the best quality terms using the same edit-distance
> metric as FuzzyQuery for ranking quality). 
> 
> 
> Cheers,
> Mark
> 
> 
> 
> 	
> 	
> 		
>
___________________________________________________________ 
> All new Yahoo! Mail "The new Interface is
stunning in its simplicity and ease of use." - PC
Magazine 
> http://uk.doc
s.yahoo.com/nowyoucan.html
> 
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
> 


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

Lucene search optimization
user name
2006-05-31 19:21:15
Hi,

Compass offers me any kind of control Lucene does. it gives
access to
the low level Lucene API if you want too, so if you have a
nice way of
optimizing it, I can have Compass adapt to that.


I tried the cityName:city~0.8, and it is still not fast
enough..
something around 2 seconds... to return only 2 results...
(city:Rambouillet~0.8)

Sami Dalouche

Le mercredi 31 mai 2006 à 09:28 +0100, mark harwood a
écrit :
> >>Actually, I am not using Lucene directly, but a
> wrapper called compass
> 
> 
> I don't know what controls it offers you then.
> One option which could offer a speed up is to raise
> the minimum quality match threshold above the default
> of 0.5 and use a query string like this:
> 
>   cityName:bond~0.8 
> 
> This would reduce the number of alternative terms
> considered and therefore the query time.
> 
> 
> --- Sami Dalouche <skoobifree.fr> wrote:
> 
> > Hi,
> > 
> > 1) Actually, I am not using Lucene directly, but a
> > wrapper called
> > compass. I am using the find() method of the
> > CompassSession, which code
> > is :
> > public CompassHits find(String query) throws
> > CompassException {
> >         return
> >
>
createQueryBuilder().queryString(query).toQuery().hits();
> >     }
> > And all of these objects are pure wrappers around
> > lucene equivalents,
> > nothing more.
> > 
> > 
> > 2) What I am timing is only the find call :
> > -- start timer
> > CompassHits hits =
compassSession.find("cityName:"+
> > name+"~");
> > -- stop timer
> > 
> > 3) I am not sorting anything, but lucene is
> > returning the hits by
> > relevance. Does this count as sorting ?
> > 
> > 4) I tried to time the thing for ~10 queries, and
> > the results are
> > roughly the same. Can go down to 2 seconds, which
is
> > still way too
> > much...
> > 
> > Thanks for helping
> > sami Dalouche
> > 
> > On Tue, 2006-05-30 at 13:58 -0700, Chris Hostetter
> > wrote:
> > > : Fuzzy searching against this property takes
> > around 3 seconds, which is
> > > : way too much for what I plan to do, so I am
> > considering the possible
> > > 
> > > whenever anyone has a question about how to
speed
> > up a search, and the
> > > current amount of time the search takes is
more
> > then a second, there are a
> > > few questions i allways want to ask:
> > > 
> > >  1) what method exactly on the Searcher
interface
> > are you using the
> > >     execute the search?
> > >  2) what exactly are you timing? (the time
the
> > search method call takes?,
> > >     the time it takes you to iterate over the
> > results? etc...)
> > >  3) are you sorting by any particular field?
> > >  4) are you reusing the Searcher instance for
more
> > then one query?   are
> > >     you timing more then one query and taking
the
> > average?
> > > 
> > > 
> > > -Hoss
> > > 
> > > 
> > >
> >
>
------------------------------------------------------------
---------
> > > To unsubscribe, e-mail:
> > java-user-unsubscribelucene.apache.org
> > > For additional commands, e-mail:
> > java-user-helplucene.apache.org
> > > 
> > 
> > 
> >
>
------------------------------------------------------------
---------
> > To unsubscribe, e-mail:
> > java-user-unsubscribelucene.apache.org
> > For additional commands, e-mail:
> > java-user-helplucene.apache.org
> > 
> > 
> 
> 
> 
> 		
>
___________________________________________________________ 
> The all-new Yahoo! Mail goes wherever you go - free
your email address from your Internet provider. http://uk.doc
s.yahoo.com/nowyoucan.html
> 
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
> 


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

[1-10] [11-14]

about | contact  Other archives ( Real Estate discussion Medical topics )