|
List Info
Thread: opening up the scorers
|
|
| opening up the scorers |

|
2008-04-17 12:43:09 |
> I've been thinking about adding new public classes
ORQuery, ANDQuery,
> ANDNOTQuery and ANDORQuery. BooleanQuery would either
be deprecated or
> removed; the logic from the compilation phase of
BooleanScorer's first
> iteration would be moved to QueryParser.
This sounds like a good idea to me, especially changing
QueryParser to
build the query directly from the components. I think it
would be
great to have a toolbox of component scorers (core or KSx)
that can be
wired together in different ways by custom QueryParsers.
I'm not sure I understand the differences between the
component
classes you are proposing though. Or maybe I do, and I just
don't
understand the names. Also, related to this and probably
evident from
my sloppy terminology, I still can't keep straight how
Queries and
Scorers relate.
AndQuery: short circuit and, scored in some way as a product
of subqueries?
OrQuery: score equal to best scoring subquery, could be
short circuit if sorted?
AndOrQuery: score all subqueries and add them, possibly
normalized?
AndNotQuery: not sure why this isn't a NotQuery, scored as
a constant?
> > ps. Marvin --- the term-by-term approach might be
a useful general
> > optimization for a special purpose additive
OrScorer.
> >
>
> Yeah, term-at-a-time scoring is great stuff, it's just
that the combining
> scorers in KS all need to go doc-at-a-time in order to
handle boolean
> constraints without blowing up.
I agree that it probably can't be the default
OrQuery/OrScorer, but it
strikes me as a useful piece of rope to tempt users who are
creating
their own queries. It also might be useful to think about
how Queries
could be split across cores/servers. If it worked, there
would be
some performance benefits of doing so per term rather than
partitioning the corpus.
Nathan Kurz
nate verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: opening up the scorers |
  United States |
2008-04-17 17:10:25 |
On Apr 17, 2008, at 10:43 AM, Nathan Kurz wrote:
>> I've been thinking about adding new public classes
ORQuery, ANDQuery,
>> ANDNOTQuery and ANDORQuery. BooleanQuery would
either be
>> deprecated or
>> removed; the logic from the compilation phase of
BooleanScorer's
>> first
>> iteration would be moved to QueryParser.
>
> This sounds like a good idea to me, especially changing
QueryParser to
> build the query directly from the components.
Groovy.
> I think it would be
> great to have a toolbox of component scorers (core or
KSx) that can be
> wired together in different ways by custom
QueryParsers.
Core is going to need these combining Query/Scorer
subclasses.
People could potentially publish KSx subclasses that compile
down to
scorers that behave differently from those in core.
> I still can't keep straight how Queries and Scorers
relate.
Query is the abstract specification. It's little more than
a parse
tree for a search string[1][2].
Scorer does the hard work of actually scoring documents.
It's the
practical application of a Query, where Query meets the real
world.
Query objects are not tied to any given collection of
documents -- you
can apply a Query to different indexes, just as you can
search for
"foo AND bar" at either Google or Yahoo.
Scorers, on the other hand, operate against specific
indexes.
> AndQuery: short circuit and, scored in some way as a
product of
> subqueries?
> OrQuery: score equal to best scoring subquery, could be
short
> circuit if sorted?
> AndOrQuery: score all subqueries and add them, possibly
normalized?
> AndNotQuery: not sure why this isn't a NotQuery,
scored as a
> constant?
ANDQuery - Search for 'a AND b'.
ORQuery - Search for 'a OR b'.
ANDNOTQuery - Search for 'a AND NOT b'.
ANDORQuery is the odd one out, because it doesn't really
mean 'a AND/
OR b'. What it does is combine one optional clause and one
required
clause.
ANDORQuery - Search for 'a +b'
I chose those names because they seemed clearer than the
Lucene
equivalents. Here's the mapping of Scorer subclasses:
KS Lucene
=======================================================
ANDScorer ConjunctionScorer
ORScorer DisjunctionSumScorer
ANDNOTScorer ReqExclScorer
ANDORScorer ReqOptSumScorer
"ConjunctionScorer" I thought was a particularly
poor name.
Grammatically speaking both 'OR' and 'AND' are
"conjunctions", but the
"Conjunction" in "ConjunctionScorer"
doesn't refer to *that* kind of
conjunction -- which is really confusing.
> I agree that it probably can't be the default
OrQuery/OrScorer, but it
> strikes me as a useful piece of rope to tempt users who
are creating
> their own queries. It also might be useful to think
about how Queries
> could be split across cores/servers. If it worked,
there would be
> some performance benefits of doing so per term rather
than
> partitioning the corpus.
Sure, there are definitely performance benefits to
term-at-a-time - it
just doesn't scale well when you need to apply boolean
constraints.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
[1] Query would be more accurately described as akin to an
"abstract
syntax tree" for a search string rather than a
"parse tree".
<http://en.wikipedia.org/wiki/Abstract_syntax_tree>
[2] It's possible to use Query objects to build query
specifications
that are difficult or nearly impossible to type into a
search
box.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: opening up the scorers |

|
2008-04-18 00:15:11 |
On Thu, Apr 17, 2008 at 4:10 PM, Marvin Humphrey
<marvin rectangular.com> wrote:
> On Apr 17, 2008, at 10:43 AM, Nathan Kurz wrote:
> > I still can't keep straight how Queries and
Scorers relate.
> >
>
> Query is the abstract specification. It's little more
than a parse tree
> for a search string[1][2].
>
> Scorer does the hard work of actually scoring
documents. It's the
> practical application of a Query, where Query meets the
real world.
I appreciate the details. I'm not being willfully obtuse, I
just
haven't been thinking about this for a while.
So the tree of Queries is used to build a tree (typically)
of Scorers,
and each Query class has a one-to-one relationship with a
Scorer
class? Is there any 'query' specific code in the query
beyond the
name of the Scorer class? My desire for simplicity makes me
wonder if
one could just have a single 'QueryNode' class that
instantiates a
customizeable Scorer.
> People could potentially publish KSx subclasses that
compile down to
> scorers that behave differently from those in core.
For a custom OrScorer that I'm interested in (short-circuit
OR,
returns the score of the first match of the ordered
children) what
would I subclass and how would I call it? My instinct is it
would be
simplest just to build the Scorer tree myself and stick with
my
FirstMatchScorer in at the appropriate places. But what
would the
right way be?
> ANDQuery - Search for 'a AND b'.
> ORQuery - Search for 'a OR b'.
> ANDNOTQuery - Search for 'a AND NOT b'.
Why not just have a NotQuery? It seems like it would be
more general,
and one could always build the 'a AND NOT b' using an AND
and a NOT.
> ANDORQuery is the odd one out, because it doesn't
really mean 'a AND/OR b'.
> What it does is combine one optional clause and one
required clause.
Ditto. Why not just layer an AND and an OR? Or an AND with
a
hypothetical 'OptionalTermScorer' that returns some non-zero
score if
the term is not found?
> I chose those names because they seemed clearer than
the Lucene
> equivalents. Here's the mapping of Scorer subclasses:
Yes, these are improvements. I do like the that Lucene
names mention
that they are 'Sum' scorers, though, as it seems useful to
distinguish
how the actual scoring is done.
Nathan Kurz
nate verse.com
ps. The ice cream goes pretty well: http://screamsorbet.com/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: opening up the scorers |
  United States |
2008-04-18 20:50:45 |
On Apr 17, 2008, at 10:15 PM, Nathan Kurz wrote:
> So the tree of Queries is used to build a tree
(typically) of Scorers,
> and each Query class has a one-to-one relationship with
a Scorer
> class?
It's close to a one-to-one relationship, but it's not,
quite. Some
optimizations are possible when compiling the Scorers.
For instance, if someone has created a PhraseQuery that only
has one
term in it, you know you can compile that down to a
TermScorer instead
of PhraseScorer. Or, even better, say you have a simple
TermQuery,
and you find out that the term isn't in the index (because
$searchable-
>doc_freq returns 0). Then you can just return undef
(indicating a
null result set) instead of a Scorer.
> Is there any 'query' specific code in the query beyond
the
> name of the Scorer class?
There is actually quite a lot that happens in between a
Query and a
Scorer. That's where the "Weight" classes come in
- they encapsulate
the process of compiling a Query to a Scorer.
Query classes are indeed, very simple. There's not much to
them
excerpt for a make_weight() factory method (and an
extract_terms()
method I'd really like to kill off).
> My desire for simplicity makes me wonder if
> one could just have a single 'QueryNode' class that
instantiates a
> customizeable Scorer.
I don't quite follow.
>> People could potentially publish KSx subclasses
that compile down to
>> scorers that behave differently from those in
core.
>
> For a custom OrScorer that I'm interested in
(short-circuit OR,
> returns the score of the first match of the ordered
children) what
> would I subclass and how would I call it?
Your ORQuery subclass would probably look like this:
package FirstMatchORQuery;
use base qw( KinoSearch::Search::ORQuery );
sub make_weight {
my $self = shift;
return FirstMatchORWeight->new( _, parent
=> $self );
}
package FirstMatchORWeight;
...
> My instinct is it would be
> simplest just to build the Scorer tree myself and stick
with my
> FirstMatchScorer in at the appropriate places. But
what would the
> right way be?
You mean how would you persuade QueryParser to use your
ORQuery
variant rather than the default? Probably we'd need to give
QueryParser some sort of make_orquery() factory method you
could
override.
I'm not sure I want that to happen right away in core,
though.
QueryParser-type classes are sadly prone to death by
Featuritis. This
is the kind of thing I'd rather see refined via KSx.
>> ANDQuery - Search for 'a AND b'.
>> ORQuery - Search for 'a OR b'.
>> ANDNOTQuery - Search for 'a AND NOT b'.
>
> Why not just have a NotQuery?
Good question, and I think, good suggestion.
When we swap out ANDNOTQuery for NOTQuery, all of a sudden
we get a
coherent suite:
ANDQuery
ORQuery
NOTQuery
ReqOptQuery
Background:
NOTQuery hasn't been needed up till now. QueryParser
doesn't parse
'NOT brobniquitz' down to a NOTQuery because it's standard
behavior
for search engines to parse that kind of thing as a void
query with no
result set rather than return the universe.
> It seems like it would be more general,
> and one could always build the 'a AND NOT b' using an
AND and a NOT.
I think this is probably a good plan. I played back a
couple
scenarios in my mind to see whether the combination of an
ANDScorer
and a NOTScorer would needlessly iterate over more results
than an
ANDNOTScorer would, but with Scorer_Skip_To, I couldn't come
up with a
case where that would happen.
There's going to be a marginal increase in CPU overhead from
wrapping
a positive scorer with a NOTScorer, but I doubt it will
matter.
>> ANDORQuery is the odd one out, because it doesn't
really mean 'a
>> AND/OR b'.
>> What it does is combine one optional clause and one
required clause.
>
> Ditto. Why not just layer an AND and an OR?
I don't think that's quite the same thing??
> Or an AND with a
> hypothetical 'OptionalTermScorer' that returns some
non-zero score if
> the term is not found?
If I follow what you're saying, I think that would sort of
work, but
it's no clearer conceptually than a ReqOptQuery combining
one required
clause and one optional clause.
> I do like the that Lucene names mention
> that they are 'Sum' scorers, though, as it seems useful
to distinguish
> how the actual scoring is done.
Right. FYI, there's also a DisjunctionMaxScorer, which is
mated with
a DisjunctionMaxQuery.
> ps. The ice cream goes pretty well: http://screamsorbet.com/
Beet Lemon Sorbet! Awesome.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: opening up the scorers |

|
2008-04-19 14:10:45 |
On Fri, Apr 18, 2008 at 7:50 PM, Marvin Humphrey
<marvin rectangular.com> wrote:
> > My desire for simplicity makes me wonder if
> > one could just have a single 'QueryNode' class
that instantiates a
> > customizeable Scorer.
> >
>
> I don't quite follow.
Instead of building a tree of different classes of Query, it
seems
simpler to me to build a tree out of nodes of the same type
and move
the class to a field:
QueryNode:
scorer: "KSx::MyScorer"
children: [Child1, Child2]
Probably just my quirk, though. I've never been liked
subclassing for
the sake of avoiding function pointers (or their Perlish
equivalents).
I'd also want a better name than QueryNode. ;) So long as
this tree
can be easily parsed and 'optimized', I guess I don't have a
problem
with the current approach, though.
> You mean how would you persuade QueryParser to use
your ORQuery variant
> rather than the default?
Yes, I'm wondering how to get a variant to actually be used.
As it
is, the the official way seems to be to rewrite QueryParser
to use my
own classes, but this seems onerous. Or one could
post-process the
Query tree and swap in the custom class. Alternatively,
one could
take the approach I did before I bogged down, and conclude
that it's
simpler to skip the indirection and build the Scorer tree
directly.
> Probably we'd need to give QueryParser some sort
> of make_orquery() factory method you could override.
> I'm not sure I want that to happen right away in core,
though.
> QueryParser-type classes are sadly prone to death by
Featuritis. This is
> the kind of thing I'd rather see refined via KSx.
Definitely the custom scorer should go in KSx (or in some
other
userspace) but there needs to be some way to use this class
without
writing a lot of other infrastructure. Either QueryParser
needs to be
more easily subclassable, or needs to have customizable
types (skip
factories, all we need is a class name string), or there
needs to be
hook to post-process the Query tree (s/// for trees a la
XSLT).
> QueryParser doesn't parse 'NOT
> brobniquitz' down to a NOTQuery because it's standard
behavior for search
> engines to parse that kind of thing as a void query
with no result set
> rather than return the universe.
I strongly think you want to 'return the universe' here. If
you
design the system so it doesn't choke on large result sets,
it will be
truly industrial strength and multi-purpose. Instead of
thinking
about this as a search engine (with standard search engine
constraints) think of KinoSearch as a general purpose
database with
some really cool retrieval functions. Make it strong and
fearless!
> > > ANDORQuery is the odd one out, because it
doesn't really mean 'a AND/OR
> b'.
>
> > Ditto. Why not just layer an AND and an OR?
> >
>
> I don't think that's quite the same thing??
I was shooting from the hip, but I think 'A AND (A OR B)'
would
produce the same results once normalized. Given the way
caching
works, this probably isn't actually that expensive, but I
can
certainly see why it isn't perfect. Alternatively, one
could allow
OrScorer a non-zero no-match score, or come up with an
'OptionalTermScorer'.
But you are probably right: while I like the building
block
simplicity of these approaches, it's not that bad to have a
custom
Scorer for this situation. Although "Term AND
OptionalTerm" is pretty
clear too. If you do go with a RequiredAndOptionalScorer,
though,
I'd request that it be able to handle arbitrary subqueries
under the
Required half, rather than just straight Terms.
> Or, even better, say you have a simple TermQuery, and
you
> find out that the term isn't in the index (because
$searchable->doc_freq
> returns 0). Then you can just return undef (indicating
a null result set)
> instead of a Scorer.
This doesn't really strike me as an 'even better', but a
recipe for
poorly handled rare error conditions. A query with no
results is
going to be very fast to run, so this optimization isn't
really saving
much. And it definitely made my code paths uglier trying to
handle
this case. I'd much prefer to get an actual empty set of
results
after running the search (which I need to handle anyway)
rather than
have his special case.
> There is actually quite a lot that happens in between
a Query and a Scorer.
> That's where the "Weight" classes come in -
they encapsulate the process of
> compiling a Query to a Scorer.
Any chance you could write up what actually happens here?
And then
perhaps feeling too embarrassed to publish the as-builts,
rework this
part of the architecture to make it simple, streamlined, and
3x5
cardable? ;)
Nathan Kurz
nate verse.com
> > ps. The ice cream goes pretty well: http://screamsorbet.com/
> >
>
> Beet Lemon Sorbet! Awesome.
Yeah, it's surprisingly good. I'll send you up some once we
figure
out the right way to package it for shipping.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: opening up the scorers |
  United States |
2008-04-21 22:05:06 |
On Apr 19, 2008, at 12:10 PM, Nathan Kurz wrote:
>> There is actually quite a lot that happens in
between a Query and a
>> Scorer.
>> That's where the "Weight" classes come in
- they encapsulate the
>> process of
>> compiling a Query to a Scorer.
>
> Any chance you could write up what actually happens
here?
I've now rewritten the documentation for Query and Weight,
added
official public APIs for Scorer and Tally, and tweaked
Cookbook::WildCardQuery. The new docs reflect something of
a shift in
perspective, thanks to our ongoing conversations -- instead
of
thinking of Weight's primary purpose as enabling reuse of
Query
objects (the original Lucene rationale), we think of Weight
as
something which compiles a Query to a Scorer.
I also did some of the work refactoring Weight that we
discussed a
while back.
Please have a look at the new POD. It gets dynamically
generated, so
you need to run Build.PL:
perl Build.PL
./Build code;
perldoc lib/KinoSearch/Search/Weight.pod
> And then
> perhaps feeling too embarrassed to publish the
as-builts, rework this
> part of the architecture to make it simple,
streamlined, and 3x5
> cardable? ;)
Things should be improved. I doubt you'll be satisfied,
though,
because the TF/IDF stuff is still in there. :
I thought about hiding away Sum_Of_Squared_Weights(),
Apply_Norm_Factor() and Normalize() away in a TFIDFWeight
subclass.
Unfortunately, if I do that, then subclasses of Weight which
aren't
also subclasses of TFIDFWeight wouldn't be able to
participate in
recursive normalizing. For instance, say you had an
ANDWeight whose
children were a TermWeight and a WildCardWeight -- you could
only
Normalize() the TermWeight, since the WildCardWeight
wouldn't have
that method. I don't think we'd be better off.
Instead, I changed those from abstract methods to real
methods with
sensible defaults, and made it clear early on in the
documentation
that they could be ignored under many circumstances.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: opening up the scorers |
  United States |
2008-04-22 00:54:42 |
On Apr 19, 2008, at 12:10 PM, Nathan Kurz wrote:
>> You mean how would you persuade QueryParser to use
your ORQuery
>> variant
>> rather than the default?
>
> Yes, I'm wondering how to get a variant to actually be
used. As it
> is, the the official way seems to be to rewrite
QueryParser to use my
> own classes, but this seems onerous.
The problem is that everyone and their dog has extremely
strong,
mutually incompatible opinions about how query parsers
should behave.
On the Lucene user list, eruptions of incredulity and
outrage over the
QueryParser design are a regular side show -- and the Lucene
QueryParser is *jammed* with features.
KinoSearch's approach is to provide only one simple
implementation
that serves the most common case. Opening up the core
QueryParser API
is a low priority (approaching zero) because I know the
feature set
will never please everyone, more features just means more
hooks to
hang hate on, and I don't need the abuse.
Instead of opening up the core class, I'd be more inclined
to write
and release KSx::Search::OpenQueryParser, which would look a
lot like
the current QueryParser but single-field and with factory
methods. In
theory, it *should* also be possible to adapt a general
purpose module
like Search::QueryParser from CPAN to build KinoSearch Query
trees --
the language of common search queries is quite small.
Things get a
little complicated when you get into Analyzers and multiple
fields,
though.
>> QueryParser doesn't parse 'NOT
>> brobniquitz' down to a NOTQuery because it's
standard behavior for
>> search
>> engines to parse that kind of thing as a void query
with no result
>> set
>> rather than return the universe.
>
> I strongly think you want to 'return the universe'
here.
Returning the universe is a perfectly reasonable behavior
for some
applications. However, I strongly disagree that it should
be the
default behavior for the core QueryParser.
If I write NOTQuery, at least then it becomes possible to
implement
your desired behavior. It's probably best if I focus my
energies on
that task.
> Instead of thinking
> about this as a search engine (with standard search
engine
> constraints) think of KinoSearch as a general purpose
database with
> some really cool retrieval functions.
KinoSearch is built around the data structure of an inverted
index,
which is not suitable for use as a general purpose database.
If we
try to pretend that KS is a database, all kinds of
annoyances start to
plague us: no unique keys, comparatively awkward updates, no
standard
way of handling one-to-many relationships, etc.
Database-like
thinking leads us to impose all kinds of constraints which
fit poorly
and would require vastly more compromises than the
search-engine model.
I recognize that people want to be able to fuse excellent
full-text
search with database-like behavior. In my judgment, the
best way to
accommodate this wish is to finish the C API and make it
possible to
hook KS into an existing database like
MyFavoriteFlavorOfSQL.
Furthermore, there's a lot more unexplored territory in the
field of
inverted indexing than in general purpose database design.
The state
of the art in search sucks, right up to the highest levels
-- Google
sucks too, they just suck less. Personally speaking, the
prospect of
writing YetAnotherDatabase holds no thrill -- but I'm both
excited and
optimistic about the possibilities for shoehorning many
different
kinds of data into segmented inverted indexes and combining
different
retrieval models. That's where I'd like to spend my time.
> If you do go with a RequiredAndOptionalScorer, though,
> I'd request that it be able to handle arbitrary
subqueries under the
> Required half, rather than just straight Terms.
Absolutely.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: opening up the scorers |

|
2008-04-22 16:39:45 |
On Mon, Apr 21, 2008 at 10:54 PM, Marvin Humphrey
<marvin rectangular.com> wrote:
> Instead of opening up the core class, I'd be more
inclined to write and
> release KSx::Search::OpenQueryParser, which would look
a lot like the
> current QueryParser but single-field and with factory
methods.
I'm pretty happy with that approach, although I think there
might be a
little more safe maneuvering room before the slope gets
slippery.
Opening up the API to allow syntax changes seems excessive;
customizing the output classes strikes me as reasonable. I
ignored
QueryParser and wrote my own, but my fear is that the burden
of
writing a parser is going to stop anyone from casually
experimenting
with different scorers.
A stray thought: QueryParser implies that it is parsing a
Query,
whereas it's probably clearer to think of it as building a
query from
some text, with the output tree being the actual Query. I
don't
suppose that QueryBuilder strikes you as a clearer name? It
would
make it clearer what it does...
> KinoSearch's approach is to provide only one simple
implementation that
> serves the most common case.
It is the most common case, but possibly only because doing
anything
else requires a lot of heavy lifting.
> > I strongly think you want to 'return the universe'
[for a bare NOT query].
> >
>
> Returning the universe is a perfectly reasonable
behavior for some
> applications. However, I strongly disagree that it
should be the default
> behavior for the core QueryParser.
>
> If I write NOTQuery, at least then it becomes possible
to implement your
> desired behavior. It's probably best if I focus my
energies on that task.
Probably an agree to disagree sort of situation. If I
picture the
QueryParser as being something that is task specific, this
is probably
a fine solution. My main preference would be to have the
Scorer
capable of ordering and returning large numbers of results
without
blowing up --- whether it does so by default is merely a
detail. So
yes, implementing a NOTQuery that the default parser
optimizes out
would be just fine for my purposes, although I might try to
argue that
this optimization should take place at some later stage to
allow for a
simpler Parser.
> > Instead of thinking
> > about this as a search engine (with standard
search engine
> > constraints) think of KinoSearch as a general
purpose database with
> > some really cool retrieval functions.
My phrasing was poor. I didn't mean at all that KinoSearch
should try
to be a _relational_ database. I probably should have said
'datastore'. And by 'general' I meant that it should be
agnostic
about the type of data it is storing, not that it should be
general
enough to use for all types of problems. I agree with you
completelythat this is a more interesting problem than
creating a
slightly better SomeSQL.
> I'm both excited and optimistic
> about the possibilities for shoehorning many different
kinds of data into
> segmented inverted indexes and combining different
retrieval models. That's
> where I'd like to spend my time.
Yes, and yes. And to help you along this path, I think it
would be
good to start spend some time on some use cases that are a
little
further away from TF/IDF scoring for full-text search. The
'more-like-this' approach that follows from Jack's project
seems like
it might be a good start. I'd love it if you could think
some about
kNN searches of ratings data. Relational databases don't
stand much
of a chance for either of these, whereas KinoSearch seems
like it
could handle either quite well.
Nathan Kurz
nate verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| OpenQueryParser (was "opening up
the scorers") |
  United States |
2008-04-23 23:21:03 |
On Apr 22, 2008, at 2:39 PM, Nathan Kurz wrote:
> On Mon, Apr 21, 2008 at 10:54 PM, Marvin Humphrey
> <marvin rectangular.com> wrote:
>> Instead of opening up the core class, I'd be more
inclined to write
>> and
>> release KSx::Search::OpenQueryParser, which would
look a lot like the
>> current QueryParser but single-field and with
factory methods.
>
> I'm pretty happy with that approach, although I think
there might be a
> little more safe maneuvering room before the slope gets
slippery.
> Opening up the API to allow syntax changes seems
excessive;
I think another nice feature for OpenQueryParser would be to
base it
on Parse::YAPP or something like that, and have the grammar
be
configurable via a constructor param. The core QueryParser
is based
off of regexes, which is fast and dependency-free but not
extensible.
A grammar-based QueryParser would offer more opportunities
for
customization.
The problem faced by any of these single-field parsers,
though, is
that things get messy when you try to combine queries that
involve
multiple fields, which is a very common practical need. Say
you're
searching for "foo AND NOT bar", you parse that
twice for the "title"
and "content" fields, then join the two parsed
queries with an
ORQuery. You end up with something like this:
title:(foo AND NOT bar) OR content:(foo AND NOT bar)
Unfortunately, ORing the two result sets together means that
any
document where the title matches 'foo AND NOT bar' will
match
regardless of whether the content field contains 'bar' --
and that's
probably not what the end user wants.
This was a bug in KS that got fixed a while back, and it
took making
QueryParser multi-field to fix it. What QueryParser does
now is
essentially this:
(title:foo OR content:foo) AND NOT (title:bar OR
content:bar)
I don't see a way to fix that problem except at a low-level
via a
multi-field parser. Do you?
> I ignored
> QueryParser and wrote my own, but my fear is that the
burden of
> writing a parser is going to stop anyone from casually
experimenting
> with different scorers.
I can see that. There are some custom Query subclass
concepts that
don't require mods to QueryParser to be practical, but there
are lots
that do.
> A stray thought: QueryParser implies that it is
parsing a Query,
> whereas it's probably clearer to think of it as
building a query from
> some text, with the output tree being the actual Query.
I don't
> suppose that QueryBuilder strikes you as a clearer
name? It would
> make it clearer what it does...
It's arguable. QueryParser does parse a query string, after
all.
>>> I strongly think you want to 'return the
universe' [for a bare NOT
>>> query].
>>>
>>
>> Returning the universe is a perfectly reasonable
behavior for some
>> applications. However, I strongly disagree that it
should be the
>> default
>> behavior for the core QueryParser.
>>
>> If I write NOTQuery, at least then it becomes
possible to implement
>> your
>> desired behavior. It's probably best if I focus my
energies on
>> that task.
>
> Probably an agree to disagree sort of situation.
The goal is to behave as an end user typing into a search
box on a
website would expect. The big web search engine sites set
the trends,
and KinoSearch's core QueryParser follows.
However, I think it would be reasonable for OpenQueryParser
to have
return-the-universe behavior by default. How easy would it
be to
switch it off?
> My main preference would be to have the Scorer
> capable of ordering and returning large numbers of
results without
> blowing up --- whether it does so by default is merely
a detail.
KS won't blow up, because the standard TopDocs search uses a
finite-
sized HitQueue to order results on the fly as scoring
proceeds rather
than accumulating a giant array of hits and sorting by score
at the end.
> So yes, implementing a NOTQuery that the default parser
optimizes out
> would be just fine for my purposes, although I might
try to argue that
> this optimization should take place at some later stage
to allow for a
> simpler Parser.
I don't think we can move it outside of QueryParser. We'd
have to
adapt every search method individually, which wouldn't be
practical or
wise.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: OpenQueryParser (was "opening
up the scorers") |

|
2008-04-27 17:07:25 |
On Wed, Apr 23, 2008 at 10:21 PM, Marvin Humphrey
<marvin rectangular.com> wrote:
> The problem faced by any of these single-field
parsers, though, is that
> things get messy when you try to combine queries that
involve multiple
> fields, which is a very common practical need.
> ...
> I don't see a way to fix that problem except at a
low-level via a
> multi-field parser. Do you?
You could have the Parser build a tree with a special field
type of
'any', which then gets expanded out to multiple fields at a
later
stage. I'd sort of like to have this stage anyway, since it
keep the
Parser more independent of the Index, and would let me do
tricks like
replacing OrScorer with MyOrScorer. Instead of trying to
build an
optimizing Parser, you could do the optimizations and checks
in a
separate pass and keep the Parser simpler.
> > A stray thought: QueryParser implies that it is
parsing a Query,
> > whereas it's probably clearer to think of it as
building a query from
> > some text, with the output tree being the actual
Query. I don't
> > suppose that QueryBuilder strikes you as a clearer
name? It would
> > make it clearer what it does...
> >
>
> It's arguable. QueryParser does parse a query string,
after all.
I think that's part of the problem. In my mind, a Query is
just
string, not a Tree. Having a QueryParser that parses a
Query (string)
and returns a ParseTree would be great. Having it parse a
query and
return a Query is confusing.
> The goal is to behave as an end user typing into a
search box on a website
> would expect. The big web search engine sites set the
trends, and
> KinoSearch's core QueryParser follows.
Do users really expect this behaviour, or is a shortcut
taken by
programmers? Realizing that probably only a tiny number of
end users
ever use stop words at all, if _I_ were to type '-foo' into
a site
search box, I would expect it to return all documents that
do not
contain the word 'foo', probably ordered by popularity.
This would
certainly be more useful than claiming that no documents
match. That
said, despite urging you to make KinoSearch more general, I
agree that
out of the box it should work the way that users expect as a
site
search engine, and that any other uses should be secondary.
> > My main preference would be to have the Scorer
> > capable of ordering and returning large numbers of
results without
> > blowing up --- whether it does so by default is
merely a detail.
> >
>
> KS won't blow up, because the standard TopDocs search
uses a finite-sized
> HitQueue to order results on the fly as scoring
proceeds rather than
> accumulating a giant array of hits and sorting by score
at the end.
'Blow up' was sloppy speech on my part. 'Grind to a halt'
would
probably be closer. I'd like to have a Scorer that is
either smart
enough to avoid processing the entire index for queries that
match
(almost) every document, or fast enough that processing the
entire
index is no big deal.
I haven't thought about it for a while, but at one point I
had a
scheme to do this with a minimum document number and a
maximum
document score. If the whole HitQueue was at the max score,
you could
return early. If a max score occurs at less than the
minimum document
number, you skip it as already returned. This would let
you
semi-efficiently do things like return hits 1,000,000 to
1,000,100,
although sometimes you'd need a second pass to pick up
stragglers.
Nathan Kurz
nate verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
|
|