List Info

Thread: Re: get doc/query similarity




Re: get doc/query similarity
country flaguser name
United States
2008-04-27 15:49:15
----- Original Message ----
> From: Marvin Humphrey <marvinrectangular.com>

[snip discussion of retrieval of a doc via IndexReader or a
Searcher]

OK, I grant that KS doesn't want to be an RDB. What seems to
me to be a good idea is a method of retrieving *one
specific* doc from the index, given some key, that is
relatively cheaper than doing a search.

> You can fake a retrieval by primary key something like
this:
> 
>     package MySearcher;
>     use base qw( KinoSearch::Searcher );
> 
>     sub im_feeling_lucky {
>        my ( $self, $key ) = _;
>        my $termquery =
KinoSearch::Search::TermQuery->new(
>           field => 'pri_key_field',
>           term  => $key,
>        );
>        my $hits = $searcher->search( query =>
$term_query, num_wanted  
> => 1 );
>        return $hits->fetch_hit;
>     }
> 
> That will work if you have your own exterior mechanism
for  
> guaranteeing the uniqueness of a particular field
during indexing.

There's nothing wrong with that API. Suppose we could extend
that by telling KS that we indeed have an exterior mechanism
of guaranteeing the uniqueness of pri_key_field. It could
then stop searching the moment it gets the first hit, and
return that. If we fail in our uniqueness guarantee, well,
it still only returns one hit, and we can't know
deterministically which one.

Moreover, KS could have a special-cased optimization for
searching that field, perhaps requireing some syntax
restrictions on the value (numeric only? single token
only?).

By the way, there seems to be a related mechanism for
$invindexer->delete_docs_by_term().

> First, we need a way of obtaining document numbers from
a search.  The  
> easiest way to make this happen is to expose
get_doc_num for HitDoc.   
> (There are other places as well, that's just the
easiest and it would  
> work for our purposes.)

But you just said you don't want to expose internal KS doc
numbers?

>    * What's a better name than DocVector? 
AnalyzedDoc?

Yes.

>    * Should we store any other information besides the
terms and
>      their positions, start_offsets and end_offsets?

You probably want to make sure that term frequencies are
accessible, as well as left/right positional context. I'm
not making claims about how these should be stored, just
that they're accessible efficiently.

One similarity metric that's useful to compute is doc-doc
similarity over token or character n-grams. How would one do
that in our brave new world?

>    * How should the data file be formatted?

That's a bit beyond my reach.

> Yeah, absolutely.  It's the same way with Lucene, and
KS scoring is  
> directly derived from the Lucene scoring model.  Lucene
and KS only  
> care about coarse relative ranking, so there are some
adulterations  
> and approximations in the similarity calculations.

Something that may be useful is a toggle to normalize the
returned scores...

Can one do pseudo-relevance feedback using KS? That is, run
a search, get some hits, then use the hits as the terms for
a new search. Optionally, loop over the hits and exclude
unwanted docs before executing the new search.






     
____________________________________________________________
________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9
tAcJ


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: get doc/query similarity
country flaguser name
United States
2008-04-28 16:11:44
On Apr 27, 2008, at 1:49 PM, jack_tanneryahoo.com
wrote:

> What seems to me to be a good idea is a method of
retrieving *one  
> specific* doc from the index, given some key, that is
relatively  
> cheaper than doing a search.

Here's the slightly more optimized version -- which, unlike
the  
earlier version, will work with a Searcher but not a
MultiSearcher.   
Also, it returns the first hit rather than the highest
scoring hit.

    package MySearcher;
    use base qw( KinoSearch::Searcher );

    sub single_hit {
       my ( $self, $key ) = _;
       my $reader = $self->get_reader;
       my $posting_list = $reader->posting_list(
          field => 'pri_key_field',
          term  => $key,
       );
       return unless $posting_list;
       my $doc_num = $posting_list->next;
       return unless $doc_num;
       return $reader->fetch_doc($doc_num);
    }

If the term is unique, I doubt that the there will be
significant  
performance differences between the two implementations. 
The i/o  
costs are exactly the same -- the searching technique just
involves a  
few more wrappers.

The more important difference is that you can get a
PostingList from  
an IndexReader, but not from a Searcher. PostingLists, like 

IndexReaders, are tied to individual machines.  Searcher is
a single- 
machine subclass of Searchable; and Searchable's interface
is designed  
to work with document collections which may or may not
reside on a  
single machine.

I realize that sounds kind of complicated, but the point is
that we  
have to balance two competing design goals with Searcher. On
one hand,  
it's a single entrance point in a large, flexible retrieval
API -- and  
it has to remain a responsible role player within the larger
system.   
On the other hand, it's the most popular entrance point, so
it has to  
be easy to use without needing to grok the whole system.

> Suppose we could extend that by telling KS that we
indeed have an  
> exterior mechanism of guaranteeing the uniqueness of
pri_key_field.  
> It could then stop searching the moment it gets the
first hit, and  
> return that.

As soon as the Searcher exhausts the posting list, scoring
stops.  So  
the cost to score all hits actually depends on whether you 

successfully enforce uniqueness from outside.  

> Moreover, KS could have a special-cased optimization
for searching  
> that field, perhaps requireing some syntax restrictions
on the value  
> (numeric only? single token only?).

You'll need to set the field to be unanalyzed; otherwise you
have add  
an Analyzer into the mix.

> By the way, there seems to be a related mechanism for
$invindexer- 
> >delete_docs_by_term().

InvIndexer is always operating on a single machine.

Hmm.  Maybe InvIndexer should be a subclass of a more
general abstract  
class defining an indexing interface which isn't necessarily
tied to  
one machine -- i.e. an index-time counterpart to
Searchable.

>> First, we need a way of obtaining document numbers
from a search.   
>> The
>> easiest way to make this happen is to expose
get_doc_num for HitDoc.
>> (There are other places as well, that's just the
easiest and it would
>> work for our purposes.)
>
> But you just said you don't want to expose internal KS
doc numbers?

I've been trying to keep them hidden, and we've gotten
pretty far  
without them.

But it was inevitable that doc numbers would get exposed
eventually,  
because they are needed by many of the low-level, expert
components  
we're starting to expose: PostingList, HitCollector, etc.

The downside of exposing KinoSearch's document numbers is
that their  
behavior isn't intuitive: they're consistent for the life of
a  
Searchable or IndexReader, but outside of that, they
sometimes change.

>>   * Should we store any other information besides
the terms and
>>     their positions, start_offsets and
end_offsets?
>
> You probably want to make sure that term frequencies
are accessible,

Good thought.  They're already in there, but I had neglected
to  
mention it.

> as well as left/right positional context.

I think you're just using different names for what I'm
calling  
"start_offset" and "end_offset", which
are measured in Unicode code  
points from the top of the field.

> One similarity metric that's useful to compute is
doc-doc similarity  
> over token or character n-grams.  How would one do that
in our brave  
> new world?

I think you'd want to actually index the n-grams by
specifying some  
sort of n-gram Analyzer subclass, then use the same
techniques we've  
been discussing.

> Something that may be useful is a toggle to normalize
the returned  
> scores...


They did that with the Lucene version of Hits and it seems
to have  
resulted in a lot of confusion.  I think it's misleading,
because it  
gives the illusion of an absolute similarity measure but not
the  
reality.

You can always normalize scores yourself off of the score of
the first  
HitDoc returned by the Hits object.  (Though that technique
doesn't  
work with sorted results because the first hit probably
isn't the  
highest scoring.)

> Can one do pseudo-relevance feedback using KS? That is,
run a  
> search, get some hits, then use the hits as the terms
for a new  
> search. Optionally, loop over the hits and exclude
unwanted docs  
> before executing the new search.


You can do that, but I've never been impressed with how
"more like  
this" queries behave using vanilla TF-IDF.  Rare terms
in the feedback  
documents which are unrelated to the original query --
especially  
proper names -- tend to spawn high-scoring irrelevant
results.

To make such a feedback system work well, I think you'd need
be  
working with a decomposed term space a la LSA and apply the
term space  
of the original query as a filter on the term space of each
feedback  
doc.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )