On Jun 25, 2007, at 2:04 PM, Doug Cutting wrote:
> Doron Cohen wrote:
>> It is very important that we would be able to
assess the search
>> quality in
>> a repeatable manner - so that anyone can repeat the
quality tests,
>> and
>> maybe find ways to improve them. (This would also
allow to verify the
>> "improvements claims" above...). This
capability seems like a
>> natural part
>> of the benchmark package. I started to look at
extending the
>> benchmark
>> package with search quality module, that would open
an index (or
>> first
>> create one), run a set of queries (similar to the
performance
>> benchmark),
>> and compute and report the set of known statistics
mentioned above
>> and
>> more. Such a module depends on input data -
documents, queries, and
>> judgements. And that's my second question. We don't
have to invent
>> this
>> data - TREC has it already, and it is getting wider
every year as
>> there are
>> more judgements. So, theoretically we could use
TREC data.
>
> We should be careful not to tune things too much for
any one
> application and/or dataset. Tools to perform
evaluation would
> clearly be valuable. But changes that improve Lucene's
results on
> TREC data may or may not be of general utility. The
best way to
> tune an application is to sample its query stream and
evaluate
> these against its documents.
>
+1. To do this, we could use Reuters or Wikipedia. The
hard part is
generating the queries and having people make relevance
judgments for
a sufficient sample size. Over time it would get better,
especially
if we had a nice way for people to add queries/judgments w/o
going
through the patch/commit process (maybe a page on the wiki
could hold
the queries and judgments? That could get tricky) we might
get more
support from outsiders.
> That said, Lucene's scoring method has never been
systematically
> tuned, and some judicious tuning based on TREC results
would
> probably benefit a majority of Lucene applications.
Ideally we can
> develop evaluation tools, use them on a variety of
datasets to find
> better defaults for Lucene, and make the tools
available so that
> folks can fine-tune things for their particular
applications.
>
+1 as well.
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|