List Info

Thread: RE: Zend_Search_Lucene: Limit results and some other questions




RE: Zend_Search_Lucene: Limit results and some other questions
user name
2008-01-10 11:58:33
Hi Ralf,

> Hi,
> 
> I would like to start using Zend_Search_Lucene for the
website search
> engine of our travel community. We have a couple of
different areas
> which need to be searchable: forum, articles, gallery,
destinations, and
> maybe members. It should be possible to just search in
one area, for
> example the forum or the destinations. But it should
also be possible to
> search in all areas in one step.

Lucene document model is very flexible. You can store (and
index) additional document attributes in a special
field(s).
Ex. special 'area' field, which may contain one or more
terms to define document area.
You can add 'area' clause to limit searching:
"($userQuery) AND area:gallery"
Or do it through the API:
-------------------
$parsedQuery =
Zend_Search_Lucene_Search_QueryParser::parse($userQuery);

$query = new Zend_Search_Lucene_Search_Query_Boolean();
$query->addSubquery($parsedQuery, true /* required */);

$areaTerm = new Zend_Search_Lucene_Index_Term('gallery',
'area');
$areaQuery = new
Zend_Search_Lucene_Search_Query_Term($keywordTerm);

$query->addSubquery($areaQuery, true /* required */);

$hits = $index->find($query);
--------------------

See documentation for details (http://framework.zend.com/manual/en/zend.search.lucen
e.html).
ZendConf presentation also may be helpful (htt
p://devzone.zend.com/content/zendcon_07_slides/Evron_Shahar_
Indexing_With_Zend_Search_Lucene-ZendCon07.pdf).


> Another requirement is to limit the results based on
the current
> selected destination. For example, if the user is
searching in the
> Greece forum, the index should be searched for all
documents from type
> forum and for all destinations in Greece. The
destination data structure
> is kept as a binary tree in a MySQL database.

I think, the best way is to store whole branch in a special
document field.

Ex. we have some document with Athens as destination. The
idea is to store whole tree path - 'Europe Greece Athens':
-----------------
$doc->addField(Zend_Search_Lucene_Field::UnStored('destin
ation', 'Europe Greece Athens'));
--------

That gives the possibility to effectively search document
(or limit search results) by any level of destination tree.

If some node name is not unique, you can specify full path
with phrase query:
"($userQuery) AND destination:"Europe
Greece""



Hm... Good candidate for Best Practice documentation
section.


Another way is to specify only final destination in the
destination field and construct additional subquery on the
fly (using MySQL data).

> The last important requirement is to limit the search
results for
> pagination, so on page 1 I only want to show results 1
to 10, on page 2
> the results 11 to 20, and so on.

Lucene (and Zend_Search_Lucene) needs to process whole
result set to calculate hit scores and return hits in a
right order.
Returned hit objects contain only internal document IDs and
don't need additional resources for processing while you
don't try to access stored document fields (!). When it
happens document is automatically loaded from the index.

Limiting result set functionality
(Zend_Search_Lucene::setResultSetLimit($newLimit)) is
intended for really huge result sets (tens or hundreds of
thousands hits) and returns "first N hits" instead
of "best N hits" (returned hits are ordered by
score).
That's not the best behavior, but it may help in some cases.
It's not suitable for pagination.

The right way for pagination implementation is to collect
all doc IDs from result set somewhere (without access to any
stored field!) and retrieve documents from the index when
it's necessary:
------------------------------
$frontendOptions = array(...);
$backendOptions = array('cache_dir' => './tmp/');
$cache = Zend_Cache::factory('Core', 'File',
$frontendOptions, $backendOptions);

if (!$result = $cache->load('myresult')) {
    $result = array();
    $scores = array();
    foreach ($index->find($query) as $hit) {
        $result[] = $hit->id;
        $scores[] = $hit->score;
    }

    $cache->save($result, 'myresult');
    $cache->save($scores, 'myscores');
} else {
    $scores = $cache->load('myscores')
}

// Output $docsPerPage documents starting from
$startResultID
for ($resultID = $startResultID;
     isset($result[$resultID])  &&  $resultID <
$startResultID+$docsPerPage;
     $resultID++) {
    $doc = index->getDocument($result[$resultID]);

    ...
    echo $doc->url;
    ...
    echo $ scores[$resultID];
}
...
-------------------------


> My first idea to solve this is to add two additional
fields to each
> document:
> 
> a) field 'area' which can only have one of the values
'forum',
>    'article', 'gallery', 'destination' or 'member'

Yeah, that's right way to do this.
The only thing I could recommend is to care about
performance with low-selectivity fields.
Ex. if you have large enough documents set (hundreds of
thousands documents) and have fields like 'sex', than engine
has to construct list which contains ~1/2 of index documents
to intersect it with other part of query result.

It may be more effective to retrieve full result set end
then filter it by checking field value.

> b) field 'destination' which will be filled with a
string that combines
>    the destination hierarchie of the destination
primary keys, e.g.
>    'Athens' has key 322, 'Greece' has key 44 and 'South
Europe' key 40,
>    so for an article about 'Athens' this field would be
filled with the
>    value '40-44-322-'. If I want to search for all
'Greece' articles I
>    will search for '40-44-*'

Yes. That's also the way to do this.
You should only a) switch default analyzer to index terms
with numbers:
---------------------------------
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new
Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInse
nsitive());
---------------
b) use some letter as delimiter (otherwise it will be
tokenized as several terms) or use your own analyzer, which
treats '-' as a part of terms. 

 
> Besides these fields I want to add some further
fields.
> 
> c) the document text to be indexed and searched
> d) the document url
> e) the page title
> f) the date of the document (last changed)
> 
> Finally, here are my questions:
> 
> 1) Does my approach for the limitation of the area and
the destination
>    fields make sense or did I overlook something?

That's right way.

> 2) I am not quite sure which field types I should use
for the six
>    fields mentioned above. Any suggestions?

My suggestion:

the document text - UnStored
the document url  - UnIndexed (if you don't plan to search
by URL)
the page title    - Text
the date of the document - Keyword

But it may depends on other things.

> 3) Does it make sense to create one index for each area
to improve
>    performance? If yes, I might forget about the
all-area search
>    facility.

That depends.
Search time is generally composed of the following
operations:
1. Index opening.
2. Executing subqueries (your base query + area
specification subquery)
3. Subqueries result set intersection.

1. Index opening is actually preloading dictionary index
(usually each 128th dictionary term with binary search
ability) and performed at first query execution. 
Larger index doesn't always have larger dictionary (if
dictionary is full).
So large index opening may take near the same time as area
sub-index opening.

On the other hand, searching through several sub-indexes
multiplies index opening time (depending on number of
sub-indices you are using).

It may take significant time.
Ex. for some simple queries index opening may takes 0.040
sec and search itself 0.002 sec.


2. I estimate base query execution time as linear function
of index size. The question is if it's comparable to index
opening time.

Area specification subquery takes fixed time depending on
areas size.


3. Subqueries result set intersection is effective now, but
also takes time.



Only tests with your production data should give right
answers.


PS This question correlates with index optimization issues.
Some tips could be found here:
http://framework.zend.com/manual/en/zend.search.luc
ene.index-creation.html#zend.search.lucene.index-creation.op
timization

http://framework.zend.com/manual/en/zend.sear
ch.lucene.best-practice.html#zend.search.lucene.best-practic
e.indexing-performance
 

It's also planned to implement multi-searcher to search
through several indices (http:/
/framework.zend.com/issues/browse/ZF-525). It may help
in future with multi-index configurations.


> 4) It might be a slight overhead to use
Zend_Search_Lucene to search
>    for a destination which basically only consists of
the destination
>    name. So using a simple search directly in the MySQL
database for
>    this area might be faster and would not need any
indexing. What do
>    others think about this?

That depends on index size, your data nature, destinations
tree size, destinations cardinality (common number of
documents per destination) and so on.
Only tests with actual data may give right answer.
 
> 5) The documentaion shows a way to limit the total
amount of results.
>    But I did not find a way how to set an offset to
limit the results
>    for pagination. Do I really need to fetch all
results and the handle
>    the pagination in my controller, which would mean
that each request
>    will return all results?

See above.


With best regards,
   Alexander Veremyev.

> Thanks for your comments and help.
> 
> Best Regards,
> 
> Ralf

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.17.13/1214 - Release
Date: 08.01.2008 13:38
 

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )