On Apr 27, 2008, at 1:49 PM, jack_tanner yahoo.com
wrote:
> What seems to me to be a good idea is a method of
retrieving *one
> specific* doc from the index, given some key, that is
relatively
> cheaper than doing a search.
Here's the slightly more optimized version -- which, unlike
the
earlier version, will work with a Searcher but not a
MultiSearcher.
Also, it returns the first hit rather than the highest
scoring hit.
package MySearcher;
use base qw( KinoSearch::Searcher );
sub single_hit {
my ( $self, $key ) = _;
my $reader = $self->get_reader;
my $posting_list = $reader->posting_list(
field => 'pri_key_field',
term => $key,
);
return unless $posting_list;
my $doc_num = $posting_list->next;
return unless $doc_num;
return $reader->fetch_doc($doc_num);
}
If the term is unique, I doubt that the there will be
significant
performance differences between the two implementations.
The i/o
costs are exactly the same -- the searching technique just
involves a
few more wrappers.
The more important difference is that you can get a
PostingList from
an IndexReader, but not from a Searcher. PostingLists, like
IndexReaders, are tied to individual machines. Searcher is
a single-
machine subclass of Searchable; and Searchable's interface
is designed
to work with document collections which may or may not
reside on a
single machine.
I realize that sounds kind of complicated, but the point is
that we
have to balance two competing design goals with Searcher. On
one hand,
it's a single entrance point in a large, flexible retrieval
API -- and
it has to remain a responsible role player within the larger
system.
On the other hand, it's the most popular entrance point, so
it has to
be easy to use without needing to grok the whole system.
> Suppose we could extend that by telling KS that we
indeed have an
> exterior mechanism of guaranteeing the uniqueness of
pri_key_field.
> It could then stop searching the moment it gets the
first hit, and
> return that.
As soon as the Searcher exhausts the posting list, scoring
stops. So
the cost to score all hits actually depends on whether you
successfully enforce uniqueness from outside.
> Moreover, KS could have a special-cased optimization
for searching
> that field, perhaps requireing some syntax restrictions
on the value
> (numeric only? single token only?).
You'll need to set the field to be unanalyzed; otherwise you
have add
an Analyzer into the mix.
> By the way, there seems to be a related mechanism for
$invindexer-
> >delete_docs_by_term().
InvIndexer is always operating on a single machine.
Hmm. Maybe InvIndexer should be a subclass of a more
general abstract
class defining an indexing interface which isn't necessarily
tied to
one machine -- i.e. an index-time counterpart to
Searchable.
>> First, we need a way of obtaining document numbers
from a search.
>> The
>> easiest way to make this happen is to expose
get_doc_num for HitDoc.
>> (There are other places as well, that's just the
easiest and it would
>> work for our purposes.)
>
> But you just said you don't want to expose internal KS
doc numbers?
I've been trying to keep them hidden, and we've gotten
pretty far
without them.
But it was inevitable that doc numbers would get exposed
eventually,
because they are needed by many of the low-level, expert
components
we're starting to expose: PostingList, HitCollector, etc.
The downside of exposing KinoSearch's document numbers is
that their
behavior isn't intuitive: they're consistent for the life of
a
Searchable or IndexReader, but outside of that, they
sometimes change.
>> * Should we store any other information besides
the terms and
>> their positions, start_offsets and
end_offsets?
>
> You probably want to make sure that term frequencies
are accessible,
Good thought. They're already in there, but I had neglected
to
mention it.
> as well as left/right positional context.
I think you're just using different names for what I'm
calling
"start_offset" and "end_offset", which
are measured in Unicode code
points from the top of the field.
> One similarity metric that's useful to compute is
doc-doc similarity
> over token or character n-grams. How would one do that
in our brave
> new world?
I think you'd want to actually index the n-grams by
specifying some
sort of n-gram Analyzer subclass, then use the same
techniques we've
been discussing.
> Something that may be useful is a toggle to normalize
the returned
> scores...
They did that with the Lucene version of Hits and it seems
to have
resulted in a lot of confusion. I think it's misleading,
because it
gives the illusion of an absolute similarity measure but not
the
reality.
You can always normalize scores yourself off of the score of
the first
HitDoc returned by the Hits object. (Though that technique
doesn't
work with sorted results because the first hit probably
isn't the
highest scoring.)
> Can one do pseudo-relevance feedback using KS? That is,
run a
> search, get some hits, then use the hits as the terms
for a new
> search. Optionally, loop over the hits and exclude
unwanted docs
> before executing the new search.
You can do that, but I've never been impressed with how
"more like
this" queries behave using vanilla TF-IDF. Rare terms
in the feedback
documents which are unrelated to the original query --
especially
proper names -- tend to spawn high-scoring irrelevant
results.
To make such a feedback system work well, I think you'd need
be
working with a decomposed term space a la LSA and apply the
term space
of the original query as a filter on the term space of each
feedback
doc.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|