List Info

Thread: Re: get doc/query similarity




Re: get doc/query similarity
country flaguser name
United States
2008-04-15 23:33:28
> From: Marvin Humphrey <marvinrectangular.com>
> 
> I've started a reply to this several times, then balled
it up and  
> ashcanned it.  I understand what you want
theoretically, and the  
> document frequency and term frequency information is in
the index and  
> accessible at least via private APIS.  The question is
how to achieve  
> whatever your end goal is efficiently and
conveniently.

So, you're asking me why exactly I want to go shoot myself
in the foot. 

The setting is NOT a general IR application. I'm working
with a very small corpus, and expensive operations are just
fine with me. 

This is a kind of an duplicate detection task. I have a
corpus of documents written by a known, small set of
authors. I want to rank the authors w.r.t. how much they
repeat themselves. To do that, I want to take all docs
written by the same author, compute their pairwise
similarities, and then average those similarities. (Probably
just take the mean.) I'm going to repeat this for all
authors. At the end, I have a "repetitiveness"
score for each author. This score is the actual end goal.

> The brute force way is to take the contents of a
document or possibly  
> a distillation of the contents and use that as your
query, hand off to  
> a Searcher and see what the search gives back.  That
gives you a bunch  
> of docs, though -- not just one.  You can constrain the
search by  
> adding a "primary key"-type requirement,
though performance of such a  
> search might be a concern with large indexes due to the
way KS  
> compiles its queries.

I can definitely do that, and then just loop over the hits
until I get the doc of interest. The only problem is if the
doc of interest is not retrieved at all... but then I can
assign that a score of 0.




     
____________________________________________________________
________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9
tAcJ


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: get doc/query similarity
country flaguser name
United States
2008-04-16 01:59:45
On Apr 15, 2008, at 9:33 PM, jack_tanneryahoo.com
wrote:
> This is a kind of an duplicate detection task. I have a
corpus of  
> documents written by a known, small set of authors. I
want to rank  
> the authors w.r.t. how much they repeat themselves. To
do that, I  
> want to take all docs written by the same author,
compute their  
> pairwise similarities, and then average those
similarities.  
> (Probably just take the mean.) I'm going to repeat this
for all  
> authors. At the end, I have a
"repetitiveness" score for each  
> author. This score is the actual end goal.

Neat.  Not that this is what you're doing, but I can imagine
something  
like this being used as a supervisory tool for people who
get paid for  
generating content when the primary criteria is volume
rather than  
quality.  Copy-and-paste documents with minor variations
would appear  
tightly grouped in vector space.

>> The brute force way is to take the contents of a
document or possibly
>> a distillation of the contents and use that as your
query, hand off  
>> to
>> a Searcher and see what the search gives back. 
That gives you a  
>> bunch
>> of docs, though -- not just one.  You can constrain
the search by
>> adding a "primary key"-type requirement,
though performance of such a
>> search might be a concern with large indexes due to
the way KS
>> compiles its queries.
>
> I can definitely do that, and then just loop over the
hits until I  
> get the doc of interest. The only problem is if the doc
of interest  
> is not retrieved at all... but then I can assign that a
score of 0.


Please let us know how it goes.

I suggest using only one field, otherwise you might get some
 
distortions and exaggerations in the scoring curves as
artifacts of  
the query parsing wizard.

You may also run afoul of the max_clause_count of 1024 in
BooleanQuery  
because the queries will have so many components.  To defeat
this in  
0.1x, add this to your code:

    # hack to override safety feature
    local  
$KinoSearch::Search::BooleanQuery::instance_vars{max_clause_
count}
	= $a_really_big_number;

KinoSearch's scoring model uses Lucene's slight variant on
vanilla TF/ 
IDF.  Length normalization is in there; the resolution is
low, but  
that shouldn't matter.  The one thing that's a little
unusual is the  
addition of a "coord" function which boosts OR'd
queries when multiple  
clauses match.  It will affect your scores, but probably not
too much  
since the formula is proportional: num_matchers /
max_matchers.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )