|
List Info
Thread: Re: Wildcards
|
|
| Re: Wildcards |
  United States |
2008-02-08 05:59:14 |
Father Chrysostomos:
> I suppose the answers to these questions are precisely
what you are
> working on.
Please take a look at the newly committed
KinoSearch: ocs::Coo
kbook::WildcardQuery and let me know how it goes:
<http://xrl.us/bfust>
(We'll have to expose Scorer and Tally as public classes,
plus all the methods
overridden in the cookbook examples.)
> Anyway, the attached patch shows what I've been trying
to do so far
> (completely untested).
I didn't see this because my email machine just crashed,
I've had to restore
from backup, and the list archive didn't preserve it.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: Re: Wildcards |

|
2008-02-08 10:45:40 |
On Feb 8, 2008, at 3:59 AM, Marvin Humphrey wrote:
> Father Chrysostomos:
>
>> I suppose the answers to these questions are
precisely what you are
>> working on.
>
> Please take a look at the newly committed
> KinoSearch: ocs::Coo
kbook::WildcardQuery and let me know how it goes:
> <http://xrl.us/bfust>
Thank you. That¢s very helpful.
> + $tally{$id} =
KinoSearch::Search::Tally->new;
> + $tally{$id}->set_score(1.0); # fixed
score of 1.0
One question: Is this the place where the weight¢s value
should be
specified? (I.e., ->set_score($weight->get_value) )
>> Anyway, the attached patch shows what I've been
trying to do so far
>> (completely untested).
>
> I didn't see this because my email machine just
crashed, I've had to
> restore
> from backup, and the list archive didn't preserve it.
It¢s probably not much use to you now, but here it is
anyway:
Father Chrysostomos
P.S.: I think there¢s something else wrong with your e-mail
program.
It¢s outputting a Bcc header, which clearly shouldn¢t be
there.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
|
| Re: Re: Wildcards |

|
2008-02-08 12:10:55 |
On 2/8/08, Marvin Humphrey <marvin rectangular.com> wrote:
> Please take a look at the newly committed
> KinoSearch: ocs::Coo
kbook::WildcardQuery and let me know how it goes:
> <http://xrl.us/bfust>
Seems like a wonderful sort of example to have. I didn't
read though
closely, but saw a small typo: the abstract refers to
'leading
wildcards' instead of 'trailing wildcards'.
--nate
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: Re: Wildcards |
  United States |
2008-02-11 20:01:20 |
On Feb 8, 2008, at 8:45 AM, Father Chrysostomos wrote:
>
>> + $tally{$id} =
KinoSearch::Search::Tally->new;
>> + $tally{$id}->set_score(1.0); # fixed
score of 1.0
>
> One question: Is this the place where the weight’s
value should be
> specified? (I.e., ->set_score($weight->get_value)
)
It would be, except that the example code never even bothers
to
calculate such a value.
I wrote up the tutorial in the spirit of designing a public
API from
first principles (and in fact, made some changes to the API
while
writing it up). I think what's in those docs is pretty
coherent, and
manages to fulfill the high-level requirements. The
Query-Weight-
Scorer hierarchy is here to stay, methinks, and that's what
I wanted
to cover.
The stuff that got left out, though, is a lot less coherent
right
now. It's spread out over Weight, Similarity, and Posting,
and I'm
not sure exactly how it should be refactored.
As a starting point... Weight should probably have a single
crucial
method which takes a Searchable as its main argument:
sub compute {
my ( $self, $searchable ) = _;
$value{$$self} = $self->get_parent->get_boost;
}
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: Re: Wildcards |

|
2008-02-11 22:41:20 |
On Feb 11, 2008, at 6:01 PM, Marvin Humphrey wrote:
> The stuff that got left out, though, is a lot less
coherent right
> now. It's spread out over Weight, Similarity, and
Posting, and I'm
> not sure exactly how it should be refactored.
If you could explain to me in a few words what each method
does, and
who calls it, I might be able to do some brainstorming.
Right now most
of it is still voodoo to me.
Father Chrysostomos
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: Re: Wildcards |
  United States |
2008-02-12 18:38:06 |
On Feb 11, 2008, at 8:41 PM, Father Chrysostomos wrote:
> If you could explain to me in a few words what each
method does, and
> who calls it, I might be able to do some brainstorming.
Right now
> most of it is still voodoo to me.
I'm working on writing up a coherent explanation.
So far, I'd be happy to tell you about
Similarity::query_norm. It
used to do something not very useful, as explained here:
http://xrl.us/bf4q3
(Link to mail-archives.apache.org)
Now it's gone. I zapped it.
More to come.
PS: Weight's docs have been improved a bit.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: Re: Wildcards |

|
2008-02-13 22:26:59 |
On 2/11/08, Marvin Humphrey <marvin rectangular.com> wrote:
> I think what's in those docs is pretty coherent, and
> manages to fulfill the high-level requirements. The
Query-Weight-
> Scorer hierarchy is here to stay, methinks, and that's
what I wanted
> to cover.
Hey Marvin ---
I think you are going along a good path here, and that
fleshing out a
couple of these examples would be very useful, both as
tutorials and
to determine if parts of the API need refinement. For a
simple (and
self-serving) example that would make me happy, I'd love to
see a
setup that scores Boolean queries purely by the weights
given by the
user: Or's return the highest subscore, And's multiply.
As to the fitness of the Query-Weight-Scorer system, you're
probably
right to want to keep it. But I think you can do a lot to
tidy it up
and make it more comprehensible to newcomers. Instead of
writing a
detailed description of each method on each object, I think
you'd get
more benefit out of a solid high level overview. I'm a firm
believer
that an easy to explain architecture will be both easier to
maintain
and easier to improve.
Father C (and lurkers), I think it would be great if you
could write
up your overview as well. Even if you haven't poked around
all the
innards in depth, you're much closer to the way it works
than most
users will ever be. So without reference to how it actually
works,
write up something describing how it should work. I think
that
understanding how potential users intuitively view a problem
is great
step toward providing them a solution.
Nathan Kurz
nate verse.com
ps. Marvin, I apologize if I'm coming across as too much of
an
armchair critic lately. I'll send you some ice cream one of
these
days. While I haven't been writing much code, the ice
creams getting
to be really tasty.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: Re: Wildcards |
  United States |
2008-02-13 23:03:03 |
On Feb 13, 2008, at 8:26 PM, Nathan Kurz wrote:
> Father C (and lurkers), I think it would be great if
you could write
> up your overview as well. Even if you haven't poked
around all the
> innards in depth, you're much closer to the way it
works than most
> users will ever be. So without reference to how it
actually works,
> write up something describing how it should work.
Quickly...
I'm pretty close to a coherent API for Weight.
I just refactored so that boosts are dealt with only at
construction
time. They now propagate from Query to Weight in a very
straightforward way: simple Query types (TermQuery,
PhraseQuery) just
copy the value, while compound query types (BooleanQuery)
multiply in
their own boost of their sub-queries. Here's a snip from
BooleanQuery.pm:
# iterate over the clauses, creating a Weight for each
one
my $boost = $self->get_boost;
my sub_weights;
for my $clause ( {
$self->get_parent->get_clauses->to_perl } ) {
my $sub_query = $clause->get_query;
my $sub_boost = $boost *
$sub_query->get_boost;
my $sub_weight = $sub_query->make_weight(
searchable => $searchable,
boost => $sub_boost,
);
push sub_weights, $sub_weight;
}
$sub_weights{$$self} = sub_weights;
What's left to refactor is to divide the remaining methods
into two
tasks: calculate a raw value, and normalize.
In the end, we'll have something like this:
sub get_value {
my $self = shift;
my $value = $self->get_raw_value;
$value *= $self->get_boost;
$value *= $self->get_norm_factor;
return $value;
}
Methods like like sum_of_squared_weights, etc, have esoteric
meanings
related to cosine similarity measures and other IR theory.
It might
be kind of hard to write them up if you aren't up-to-speed
on the
relevant topics. Also, the architecture inherited from
Lucene was a
spaghettified mess -- the code is hard to follow. While it
would be
cool to see writeups, *I* have a hard time with this part of
the code
base -- a lot was cargo-culted then verified only by
comparing KS
scores against Lucene scores.
Lemme finish revising Weight *before* anybody writes it up.
Then I
look forward to making a second leap forward after some
feedback.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: Re: Wildcards |

|
2008-02-14 22:36:21 |
On Feb 13, 2008, at 9:03 PM, Marvin Humphrey wrote:
>
> On Feb 13, 2008, at 8:26 PM, Nathan Kurz wrote:
>
>> Father C (and lurkers), I think it would be great
if you could write
>> up your overview as well. Even if you haven't
poked around all the
>> innards in depth, you're much closer to the way it
works than most
>> users will ever be. So without reference to how it
actually works,
>> write up something describing how it should work.
I¢m actually quite clueless in this regard. Yes, I¢ve looked
at the
code (at least what¢s in Perl), but I still don¢t know
what¢s going
on. As for how it should work, I have no idea...just as long
as I can
use it.
> Quickly...
>
> I'm pretty close to a coherent API for Weight.
>
> [... blah blah blah...]
>
> What's left to refactor ....
Could you include a way for a set of terms to be treated as
a single
term with regard to scoring, i.e., as if ¡fool¢ and ¡food¢
(in a
wildcard foo* match, for instance) were simple stored as
¡foo¢ in the
index (the way word stemming works)? (If I¢m not making
myself clear,
please let me know.) If you don¢t want to include this in
core
KinoSearch, could you at least bear this in mind? This
would, I
believe, affect the way doc_freq is calculated.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: Re: Wildcards |
  United States |
2008-02-15 15:16:44 |
On Feb 14, 2008, at 8:36 PM, Father Chrysostomos wrote:
>> Could you include a way for a set of terms to be
treated as a
>> single term with regard to scoring, i.e., as if
‘fool’ and
>> ‘food’ (in a wildcard foo* match, for instance)
were simple stored
>> as ‘foo’ in the index (the way word stemming
works)?
The way the index is laid out, each term gets its own
posting list
with its own set of ascending document numbers. Scorers
have to
iterate through document numbers in ascending order. The
only way to
combine multiple posting lists is to interleave the doc num
sets.
The only search-time options are 1) run through each set and
build up
a superset before the Scorer starts iterating, or 2) put
multiple
PostingList objects into a priority queue sorted by
ascending doc num.
Another approach is to break all terms into all possible
substrings at
index time and store them in a separate
"substrings" field. The size
of the index will explode, but then "foo*" becomes
a simple term query
for "foo".
> (If I’m not making myself clear, please let me know.)
If you don’t
> want to include this in core KinoSearch, could you at
least bear
> this in mind? This would, I believe, affect the way
doc_freq is
> calculated.
Yes, doc_freq is a difficult problem to solve with
wildcards.
It's particularly hard when you get to dealing with several
indexes
across multiple machines.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
|
|