On Apr 15, 2008, at 9:33 PM, jack_tanner yahoo.com
wrote:
> This is a kind of an duplicate detection task. I have a
corpus of
> documents written by a known, small set of authors. I
want to rank
> the authors w.r.t. how much they repeat themselves. To
do that, I
> want to take all docs written by the same author,
compute their
> pairwise similarities, and then average those
similarities.
> (Probably just take the mean.) I'm going to repeat this
for all
> authors. At the end, I have a
"repetitiveness" score for each
> author. This score is the actual end goal.
Neat. Not that this is what you're doing, but I can imagine
something
like this being used as a supervisory tool for people who
get paid for
generating content when the primary criteria is volume
rather than
quality. Copy-and-paste documents with minor variations
would appear
tightly grouped in vector space.
>> The brute force way is to take the contents of a
document or possibly
>> a distillation of the contents and use that as your
query, hand off
>> to
>> a Searcher and see what the search gives back.
That gives you a
>> bunch
>> of docs, though -- not just one. You can constrain
the search by
>> adding a "primary key"-type requirement,
though performance of such a
>> search might be a concern with large indexes due to
the way KS
>> compiles its queries.
>
> I can definitely do that, and then just loop over the
hits until I
> get the doc of interest. The only problem is if the doc
of interest
> is not retrieved at all... but then I can assign that a
score of 0.
Please let us know how it goes.
I suggest using only one field, otherwise you might get some
distortions and exaggerations in the scoring curves as
artifacts of
the query parsing wizard.
You may also run afoul of the max_clause_count of 1024 in
BooleanQuery
because the queries will have so many components. To defeat
this in
0.1x, add this to your code:
# hack to override safety feature
local
$KinoSearch::Search::BooleanQuery::instance_vars{max_clause_
count}
= $a_really_big_number;
KinoSearch's scoring model uses Lucene's slight variant on
vanilla TF/
IDF. Length normalization is in there; the resolution is
low, but
that shouldn't matter. The one thing that's a little
unusual is the
addition of a "coord" function which boosts OR'd
queries when multiple
clauses match. It will affect your scores, but probably not
too much
since the formula is proportional: num_matchers /
max_matchers.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|