List Info

Thread: Tips on building a better BooleanQuery




Tips on building a better BooleanQuery
user name
2006-04-28 21:02:10
Hi!

I'm planning on contributing to Lucene by adding a new kind
of query. I dont
know how to call it yet, but it would be a mix of
BooleanQuery and
ExactPhraseQuery.

I would like to have a Query that is a BooleanQuery, but
with a slight touch
where it would boost results if it finds the query terms in
a an exact
phrase.

For example, if I have terms A, B and C and I do a simple
boolean search : A
B C, I would like to have a query that behaves a bit like if
I rewrote this
query as such :

+A +B +C "A B" "B C" "A B
C"

This would boost results where the exact string "A B
C" or any substring
like "A B" or "B C" are found.

Of course I could rewrite all the queries, but it takes way
too long to
search which this algorithm. I wanted to know if anyone has
any ideas in
what direction I should go, or if its easy or not to
implement this idea by
modifying or extending some already existing Query classes.

I'm fairly new to Lucene although I know a bit about search
engines, idf,
etc... but I've tried to understand BooleanQuery and
ExactPhraseQuery to see
how I could modify them and I'm having a bit of a problem
understanding it
all on my own I guess.

Any help or comments would be appreciated, and if it works
well I do think
it would be a good addition to the Lucene code base (I think
this query
should be used as a default in the QueryParser if it works
ok instead of a
simple BooleanQuery).

Thanks in advance for your help,
Daniel Shane
Tips on building a better BooleanQuery
user name
2006-04-28 21:58:32
Daniel Shane wrote:
>> For example, if I have terms A, B and C and I do a
simple boolean search 
> : A
> B C, I would like to have a query that behaves a bit
like if I rewrote this
> query as such :
> 
> +A +B +C "A B" "B C" "A B
C"
> 
> This would boost results where the exact string
"A B C" or any substring
> like "A B" or "B C" are found.

You could extend ExactPhraseScorer and override phraseFreq()
to first 
compute a (weighted) sum the freq() of each TermPositions in
the 
PhrasePositions.  The weighting should probably be something
like 
sqrt(freq)/idf(term).

Doug

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )