List Info

Thread: Boolean searching across multiple fields




Boolean searching across multiple fields
user name
2006-10-12 00:32:55
On Oct 11, 2006, at 4:05 PM, Chris Nandor wrote:

>>> I know in the next version I can do, simply:
>>>
>>>  	my $query_parser =
KinoSearch::QueryParser::QueryParser->new(
>>>  		analyzer	=> $analyzer,
>>>  		fields		=> fields,
>>>  	);
>
> So this code will allow the above behavior, then?

Yes.  QueryParser behaves like this because it's the most
intuitive  
behavior for the common case.

Most often, people want to search multiple fields -- say,
title and  
body.  A required term such as "+senator" must
match against AT LEAST  
ONE field out of several.  A prohibited term such as
"-senator" MUST  
NOT MATCH AGAINST ANY of them.  It's as if all the fields
were  
flattened into one and QueryParser was generating a query
against  
that.  However, the scoring algorithm still gets to use
multiple  
fields, which is important for returning the most relevant
document set.

The guts that make that happen are kind of complicated
(thank dog for  
tests!) but the concept is straightforward:
QueryParser processes the input string one chunk at a time.

Consider the following input:

     '+foo -bar "okee dokee"'

First chunk is '+foo'.  It gets expanded to...

     '+(title:foo OR body:foo)'

Next, '-bar' expands to...

     '-(title:bar OR body:bar)'

Lastly, the phrase '"okee dokee"' gets treated as
a single chunk,  
expanding to...

     '(title:"okee dokee" OR body:"okee
dokee")'

(Note that the internal mechanism isn't literal text
expansion --  
QueryParser is using Query objects.)

> Curiously, how would I do it in 0.12?  Knowing that may
help me  
> understand
> the whole thing better.

That particular configuation is actually kind of hard to
nail with  
0.12.  The "negate operator bug" that was fixed in
0.13 actually  
affected queries in which all clauses are required too, of
which your  
'+foo +bar' is the perfect reduced example.

QueryParser's clever trick is to handle the string chunk by
chunk.   
There's no public API for squeezing chunks out of
QueryParser one-at- 
a-time, though, so you can't duplicate the multi-field
functionality  
easily.

As a workaround, you can dump all content into one big
field.

    $doc->set_value( title       => $title );
    $doc->set_value( body        => $body );
    $doc->set_value( all_content => "$title
$body" );

Then, you create a QueryParser against the all_content
field, and  
your search for '+foo +bar' returns the correct set of
documents.

     my $query_parser =
KinoSearch::QueryParser::QueryParser->new(
         default_field => 'all_content',
     );
     my $query = $query_parser->parse('+foo +bar');

Essentially, you are flattening the fields yourself, rather
than  
letting the QueryParser from KinoSearch 0.13 do it for you.

This option gets recommended all the time on the Lucene
user's list,  
and it's OK for small document sets.  However, the relevancy
from  
that searcg will be inferior to a search performed against
multiple  
fields, because the title text gets dumped into all_content
rather  
than staying separate -- where, as a short field, it will  
automatically be weighted more heavily.  With large document
sets,  
relevancy becomes a major concern, and I recommend against
this  
technique.

Another option is to rewrite your requirements.    Make sure
that  
'foo' and 'bar' come to you already split up -- say from
different  
HTML form fields -- so you don't need to rely on QueryParser
to break  
up the string and determine what's required/prohibited. 
Then, you  
can build up your own compound BooleanQuery piece by piece.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/



_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )