On 1/30/08, Marvin Humphrey <marvin rectangular.com> wrote:
> Query objects represent an abstract ideal. They don't
dirty their
> hands with actual real-world index data.
>
> Query: A pure, abstract representation of a
logical query.
> Weight: Applies a query to a particular collection
of documents.
> Scorer: Applies a query to individual documents.
>
> This is an area that's ripe for refactoring. I've
already pulled out
> a bunch of cruft that was inherited from Lucene. Why
don't we see
> what we come up with if we go back to first
principles? I think the
> division of labor in the Query-Weight-Scorer hierarchy
described
> above is sound. Do you agree?
I apologize for the delay on getting back to you on this.
This
strikes me as a very important discussion to have, and I'd
love to
have a wider discussion about it. When I was spending lots
of time
with KinoSearch, my poor understanding of how these pieces
interrelate
really hampered me. I'm sure some of this was due to my
incorrect
preconceptions, but I also think that clearer and simpler
architecture
here would help the project a lot.
Rather than discussing whether the view you offer is sound,
let me
start by offering my naive stick figure drawing of how I
picture
search working:
----
A Query is produced by a Parser: a user enters text, and
this text is
compiled into a Query. A Query can be composed of
sub-Queries, which
could be run as Queries in their own right. At the C
level, Queries
have a public data API, which is filled in by Index specific
Posting.
The tree structure of the Query is also public, so that
Scorers can
walk this tree and run sub-Scorers attached to sub-Queries.
When the Query is run against an Index, it is initialized
top-down.
The bottom level TermQueries are associated with Postings
specific to
the Index format being used, and Scorers associated with
each Query
are initialized. Then there is a loop, where each Query is
advanced
until it finds a matching document. The Queries propagate
the advance
downward, ratcheting forward until satisfied with its
children's
states.
When the topmost Query has found a matching document, a
Scorer
associated with the top Query is run. This Scorer returns a
numeric
score, but can generate this number however it chooses. A
simple
scorer (on a TermQuery) might simply return the weight it
was passed
at initialization. A more complex Scorer (for an OrQuery)
might run
the sub-Scorers on each of the children of its associated
Query and
return the result of its highest sub-Scoring child.
But while a Scorer might be a default for given type of
Query, it is
not intimately associated with it. A PhraseQuery might
choose to use
a simple WeightScorer and the sub-Scorers associated with
the Terms
would never be called. Certain Scorers might only make
sense with
queries that have children, but this would be enforced by
typechecks
at initialization.
The score is passed to a Collector along with the current
filled in
Query. The Collector keeps a record of the scores it wants
(top N,
top N with an offset, top N below a set score, all scores
above a
minimum, whatever) and discards the rest. If it wants, it
can cache
other information associated with the Query to be returned
with the
Results. After there are no more matching documents, some
'finalize'
method is called on the Collector, allowing it to tidy up
and return
its Results.
----
I'm sure there in contradictions in my description, and
probably some
things that would be impossible to implement exactly as
described.
But this is basically the architecture that makes sense to
me
(although it's true I'm thinking a lot more about ice cream
right now
than search). The current KinoSearch architecture --- even
though I
put a lot of time into spelunking it --- never really sunk
in. In
particular, I never figured out why Weights existed.
Which is to say that I don't think the current division of
labor is
that sound, and that I'd love to help you come up with a
basic
architecture that is simpler and more flexible.
Unfortunately, I'm
probably too far out of KinoSearch mode right now to be of
much use
other than as a sounding board. But I'll do all I can to
encourage
you to keep thinking about the problem from first
principles, since I
think you are bound to come up with something I like better
than the
current Lucene based architecture!
Nathan Kurz
nate verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|