List Info

Thread: optimizing single document searches




optimizing single document searches
country flaguser name
United States
2007-02-27 17:25:15
I am using Lucene in a little bit weird way, instead of
searching all 
the documents for a specific query, I am searching a single
document for 
many specific queries. 

On a single document of 10k characters, doing about 40k
searches takes 
about 5 seconds.  This is not bad, but I was wondering if I
can somehow 
speed this up.  It also takes about 5 seconds to generate
the 
searchTerms (which is fine, since I will do it once and
cache it). 

I'm not sure what information would be needed, but my
queries look 
something like this:

"Brooklyn NY"

I am currently using SpanNearQuery with a slop of 0 and
inOrder of 
false.  Is there perhaps another type of Query I can use to
speed things 
up?  TermQuery doesn't work since I have multiple terms, and
PhraseQuery 
seems to take around the same time, and is not compatible
with 
SpanNearQuery (I later merge this query with another in a
SpanNearQuery). 

I can live without merging this into the SpanNearQuery, as
long as I can 
find something that can do the 40k searches faster. 

Russ

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: optimizing single document searches
country flaguser name
Sweden
2007-02-27 17:37:55
28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
]

> On a single document of 10k characters, doing about 40k
searches  
> takes about 5 seconds.  This is not bad, but I was
wondering if I  
> can somehow speed this up.

Your corpus contains only one document? Try contrib/memory,
an index  
optimized for that scenario.

-- 
karl

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: optimizing single document searches
user name
2007-02-27 17:49:45
Which is very, very cool. I wound up using it for hit
counting and it
works like a charm....

On 2/27/07, karl wettin <karl.wettingmail.com> wrote:
>
>
> 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> ]
>
> > On a single document of 10k characters, doing
about 40k searches
> > takes about 5 seconds.  This is not bad, but I was
wondering if I
> > can somehow speed this up.
>
> Your corpus contains only one document? Try
contrib/memory, an index
> optimized for that scenario.
>
> --
> karl
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>
Re: optimizing single document searches
country flaguser name
Canada
2007-02-27 17:49:06
Thanks, I will try it tommorow... Is it significantly
different from using a standard index on a ramdir?

Russ
Sent wirelessly via BlackBerry from T-Mobile.  

-----Original Message-----
From: karl wettin <karl.wettingmail.com>
Date: Wed, 28 Feb 2007 00:37:55 
To:java-userlucene.apache.org
Subject: Re: optimizing single document searches


28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
]

> On a single document of 10k characters, doing about 40k
searches  
> takes about 5 seconds.  This is not bad, but I was
wondering if I  
> can somehow speed this up.

Your corpus contains only one document? Try contrib/memory,
an index  
optimized for that scenario.

-- 
karl

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: optimizing single document searches
country flaguser name
Sweden
2007-02-27 18:09:20
28 feb 2007 kl. 00.49 skrev Russ:

> Thanks, I will try it tommorow... Is it significantly
different  
> from using a standard index on a ramdir?
>

A bit different.

You can also try LUCENE-550. It has about the same speed as
contrib/ 
memory but can handle multiple documents and use reader,
writer and  
searcher as any other index.

-- 
karl

> Russ
> Sent wirelessly via BlackBerry from T-Mobile.
>
> -----Original Message-----
> From: karl wettin <karl.wettingmail.com>
> Date: Wed, 28 Feb 2007 00:37:55
> To:java-userlucene.apache.org
> Subject: Re: optimizing single document searches
>
>
> 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> ]
>
>> On a single document of 10k characters, doing about
40k searches
>> takes about 5 seconds.  This is not bad, but I was
wondering if I
>> can somehow speed this up.
>
> Your corpus contains only one document? Try
contrib/memory, an index
> optimized for that scenario.
>
> -- 
> karl
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: optimizing single document searches
country flaguser name
Canada
2007-02-27 18:01:54
I will definatelly check it out tommorow.

I also forgot to mention that I am not interested in the
hits themselves, only whether or not there was a hit.  Is
there something I can use that's optimized for this
scenario, or should I look into rewriting the search method
of the indexarsearcher?  Currently I just check
hits.size().

Russ
Sent wirelessly via BlackBerry from T-Mobile.  

-----Original Message-----
From: "Erick Erickson" <erickericksongmail.com>
Date: Tue, 27 Feb 2007 18:49:45 
To:java-userlucene.apache.org
Subject: Re: optimizing single document searches

Which is very, very cool. I wound up using it for hit
counting and it
works like a charm....

On 2/27/07, karl wettin <karl.wettingmail.com> wrote:
>
>
> 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> ]
>
> > On a single document of 10k characters, doing
about 40k searches
> > takes about 5 seconds.  This is not bad, but I was
wondering if I
> > can somehow speed this up.
>
> Your corpus contains only one document? Try
contrib/memory, an index
> optimized for that scenario.
>
> --
> karl
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>

Re: optimizing single document searches
country flaguser name
Netherlands
2007-02-28 13:41:08
On Wednesday 28 February 2007 01:01, Russ wrote:
> I will definatelly check it out tommorow.
> 
> I also forgot to mention that I am not interested in
the hits themselves, 
only whether or not there was a hit.  Is there something I
can use that's 
optimized for this scenario, or should I look into rewriting
the search 
method of the indexarsearcher?  Currently I just check
hits.size().

For a single document: get the Scorer from the Query via
Weight.
Then check the return value of Scorer.next(), it will
indicate whether
the only doc matches the query.

Regards,
Paul Elschot.


> 
> Russ
> Sent wirelessly via BlackBerry from T-Mobile.  
> 
> -----Original Message-----
> From: "Erick Erickson" <erickericksongmail.com>
> Date: Tue, 27 Feb 2007 18:49:45 
> To:java-userlucene.apache.org
> Subject: Re: optimizing single document searches
> 
> Which is very, very cool. I wound up using it for hit
counting and it
> works like a charm....
> 
> On 2/27/07, karl wettin <karl.wettingmail.com> wrote:
> >
> >
> > 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> > ]
> >
> > > On a single document of 10k characters, doing
about 40k searches
> > > takes about 5 seconds.  This is not bad, but
I was wondering if I
> > > can somehow speed this up.
> >
> > Your corpus contains only one document? Try
contrib/memory, an index
> > optimized for that scenario.
> >
> > --
> > karl
> >
> >
------------------------------------------------------------
---------
> > To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> > For additional commands, e-mail:
java-user-helplucene.apache.org
> >
> >
> 
> 
> 
> 

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: optimizing single document searches
country flaguser name
United States
2007-02-28 15:28:55
karl wettin wrote:
>
> 28 feb 2007 kl. 00.49 skrev Russ:
>
>> Thanks, I will try it tommorow... Is it
significantly different from 
>> using a standard index on a ramdir?
>>
>
> A bit different.
>
> You can also try LUCENE-550. It has about the same
speed as 
> contrib/memory but can handle multiple documents and
use reader, 
> writer and searcher as any other index.
>
> --karl
>
Karl,

Thank you.  I tried the contrib/memory and it's awesome. 
Got my search 
time down to 300ms from 5 seconds. 

I'm still having some performance issues on the set up.  I
can probably 
live with them, as I'll be caching these terms, but maybe I
can optimize 
it somehow.  It currently takes about 3.5 seconds to set up.
 I am 
basically creating 40k SpanNearQueries.  Here is my method
that creates 
them.  Is there anything I can improve?

private static Analyzer analyzer=new StandardAnalyzer();
public static SpanNearQuery createSpanNearQuery(String
string, int slop, 
boolean inOrder)
    {
        Vector terms=new Vector();
        TokenStream
tokenizer=Lucene.analyzer.tokenStream("body", new

StringReader(string));
        Token token = null;
        do {

            try {
                token=tokenizer.next();
            } catch (Exception e) {
                e.printStackTrace();
            }
            if (token!=null)
            {
                terms.add(new SpanTermQuery(new 
Term("body",token.termText())));
            }
        }
        while (token!=null && terms.size()<10);
       
        SpanTermQuery[] termsArray=new
SpanTermQuery[terms.size()];
        for (int i=0;i<terms.size();i++)
        {
            termsArray[i]=(SpanTermQuery) terms.get(i);
        }
        return new SpanNearQuery(termsArray,slop,inOrder);
    }


Russ

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


[1-8]

about | contact  Other archives ( Real Estate discussion Medical topics )