List Info

Thread: Re: Wildcards




Re: Wildcards
country flaguser name
United States
2008-02-08 05:59:14
Father Chrysostomos:

> I suppose the answers to these questions are precisely
what you are
> working on. 

Please take a look at the newly committed
KinoSearch:ocs::Coo
kbook::WildcardQuery and let me know how it goes:
<http://xrl.us/bfust>

(We'll have to expose Scorer and Tally as public classes,
plus all the methods
overridden in the cookbook examples.)

> Anyway, the attached patch shows what I've been trying
to do so far
> (completely untested).

I didn't see this because my email machine just crashed,
I've had to restore
from backup, and the list archive didn't preserve it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: Re: Wildcards
user name
2008-02-08 10:45:40
On Feb 8, 2008, at 3:59 AM, Marvin Humphrey wrote:

> Father Chrysostomos:
>
>> I suppose the answers to these questions are
precisely what you are
>> working on. 
>
> Please take a look at the newly committed
> KinoSearch:ocs::Coo
kbook::WildcardQuery and let me know how it goes:
> <http://xrl.us/bfust>

Thank you. That¢s very helpful.

> +        $tally{$id}    =
KinoSearch::Search::Tally->new;
> +        $tally{$id}->set_score(1.0);    # fixed
score of 1.0

One question: Is this the place where the weight¢s value
should be  
specified? (I.e., ->set_score($weight->get_value) )

>> Anyway, the attached patch shows what I've been
trying to do so far
>> (completely untested).
>
> I didn't see this because my email machine just
crashed, I've had to  
> restore
> from backup, and the list archive didn't preserve it.

It¢s probably not much use to you now, but here it is
anyway:





Father Chrysostomos

P.S.: I think there¢s something else wrong with your e-mail
program.  
It¢s outputting a Bcc header, which clearly shouldn¢t be
there.


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


  
Re: Re: Wildcards
user name
2008-02-08 12:10:55
On 2/8/08, Marvin Humphrey <marvinrectangular.com> wrote:
> Please take a look at the newly committed
> KinoSearch:ocs::Coo
kbook::WildcardQuery and let me know how it goes:
> <http://xrl.us/bfust>

Seems like a wonderful sort of example to have.  I didn't
read though
closely, but saw a small typo:  the abstract refers to
'leading
wildcards' instead of 'trailing wildcards'.

--nate

_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: Re: Wildcards
country flaguser name
United States
2008-02-11 20:01:20
On Feb 8, 2008, at 8:45 AM, Father Chrysostomos wrote:
>
>> +        $tally{$id}    =
KinoSearch::Search::Tally->new;
>> +        $tally{$id}->set_score(1.0);    # fixed
score of 1.0
>
> One question: Is this the place where the weight’s
value should be  
> specified? (I.e., ->set_score($weight->get_value)
)

It would be, except that the example code never even bothers
to  
calculate such a value.

I wrote up the tutorial in the spirit of designing a public
API from  
first principles (and in fact, made some changes to the API
while  
writing it up).  I think what's in those docs is pretty
coherent, and  
manages to fulfill the high-level requirements.  The
Query-Weight- 
Scorer hierarchy is here to stay, methinks, and that's what
I wanted  
to cover.

The stuff that got left out, though, is a lot less coherent
right  
now.  It's spread out over Weight, Similarity, and Posting,
and I'm  
not sure exactly how it should be refactored.

As a starting point... Weight should probably have a single
crucial  
method which takes a Searchable as its main argument:

   sub compute {
     my ( $self, $searchable ) = _;
     $value{$$self} = $self->get_parent->get_boost;
   }

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: Re: Wildcards
user name
2008-02-11 22:41:20
On Feb 11, 2008, at 6:01 PM, Marvin Humphrey wrote:

> The stuff that got left out, though, is a lot less
coherent right  
> now.  It's spread out over Weight, Similarity, and
Posting, and I'm  
> not sure exactly how it should be refactored.

If you could explain to me in a few words what each method
does, and  
who calls it, I might be able to do some brainstorming.
Right now most  
of it is still voodoo to me.


Father Chrysostomos

_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: Re: Wildcards
country flaguser name
United States
2008-02-12 18:38:06
On Feb 11, 2008, at 8:41 PM, Father Chrysostomos wrote:

> If you could explain to me in a few words what each
method does, and  
> who calls it, I might be able to do some brainstorming.
Right now  
> most of it is still voodoo to me.

I'm working on writing up a coherent explanation.

So far, I'd be happy to tell you about
Similarity::query_norm.  It  
used to do something not very useful, as explained here:

   http://xrl.us/bf4q3
(Link to mail-archives.apache.org)

Now it's gone.  I zapped it.

More to come.

PS: Weight's docs have been improved a bit.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: Re: Wildcards
user name
2008-02-13 22:26:59
On 2/11/08, Marvin Humphrey <marvinrectangular.com> wrote:
> I think what's in those docs is pretty coherent, and
> manages to fulfill the high-level requirements.  The
Query-Weight-
> Scorer hierarchy is here to stay, methinks, and that's
what I wanted
> to cover.

Hey Marvin ---

I think you are going along a good path here, and that
fleshing out a
couple of these examples would be very useful, both as
tutorials and
to determine if parts of the API need refinement.  For a
simple (and
self-serving) example that would make me happy, I'd love to
see a
setup that scores Boolean queries purely by the weights
given by the
user: Or's return the highest subscore, And's multiply.

As to the fitness of the Query-Weight-Scorer system, you're
probably
right to want to keep it.  But I think you can do a lot to
tidy it up
and make it more comprehensible to newcomers.   Instead of
writing a
detailed description of each method on each object, I think
you'd get
more benefit out of a solid high level overview.  I'm a firm
believer
that an easy to explain architecture will be both easier to
maintain
and easier to improve.

Father C (and lurkers), I think it would be great if you
could write
up your overview as well.  Even if you haven't poked around
all the
innards in depth, you're much closer to the way it works
than most
users will ever be.  So without reference to how it actually
works,
write up something describing how it should work.  I think
that
understanding how potential users intuitively view a problem
is great
step toward providing them a solution.

Nathan Kurz
nateverse.com

ps.  Marvin, I apologize if I'm coming across as too much of
an
armchair critic lately.  I'll send you some ice cream one of
these
days.  While I haven't been writing much code, the ice
creams getting
to be really tasty.

_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: Re: Wildcards
country flaguser name
United States
2008-02-13 23:03:03
On Feb 13, 2008, at 8:26 PM, Nathan Kurz wrote:

> Father C (and lurkers), I think it would be great if
you could write
> up your overview as well.  Even if you haven't poked
around all the
> innards in depth, you're much closer to the way it
works than most
> users will ever be.  So without reference to how it
actually works,
> write up something describing how it should work.

Quickly...

I'm pretty close to a coherent API for Weight.

I just refactored so that boosts are dealt with only at
construction  
time.  They now propagate from Query to Weight in a very  
straightforward way: simple Query types (TermQuery,
PhraseQuery) just  
copy the value, while compound query types (BooleanQuery)
multiply in  
their own boost of their sub-queries.  Here's a snip from  
BooleanQuery.pm:

     # iterate over the clauses, creating a Weight for each
one
     my $boost = $self->get_boost;
     my sub_weights;
     for my $clause ( {
$self->get_parent->get_clauses->to_perl } ) {
         my $sub_query  = $clause->get_query;
         my $sub_boost  = $boost *
$sub_query->get_boost;
         my $sub_weight = $sub_query->make_weight(
             searchable => $searchable,
             boost      => $sub_boost,
         );
         push sub_weights, $sub_weight;
     }
     $sub_weights{$$self} = sub_weights;

What's left to refactor is to divide the remaining methods
into two  
tasks: calculate a raw value, and normalize.

In the end, we'll have something like this:

    sub get_value {
        my $self = shift;
        my $value = $self->get_raw_value;
        $value *= $self->get_boost;
        $value *= $self->get_norm_factor;
        return $value;
    }

Methods like like sum_of_squared_weights, etc, have esoteric
meanings  
related to cosine similarity measures and other IR theory. 
It might  
be kind of hard to write them up if you aren't up-to-speed
on the  
relevant topics.  Also, the architecture inherited from
Lucene was a  
spaghettified mess -- the code is hard to follow.  While it
would be  
cool to see writeups, *I* have a hard time with this part of
the code  
base -- a lot was cargo-culted then verified only by
comparing KS  
scores against Lucene scores.

Lemme finish revising Weight *before* anybody writes it up. 
Then I  
look forward to making a second leap forward after some
feedback.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: Re: Wildcards
user name
2008-02-14 22:36:21
On Feb 13, 2008, at 9:03 PM, Marvin Humphrey wrote:

>
> On Feb 13, 2008, at 8:26 PM, Nathan Kurz wrote:
>
>> Father C (and lurkers), I think it would be great
if you could write
>> up your overview as well.  Even if you haven't
poked around all the
>> innards in depth, you're much closer to the way it
works than most
>> users will ever be.  So without reference to how it
actually works,
>> write up something describing how it should work.

I¢m actually quite clueless in this regard. Yes, I¢ve looked
at the  
code (at least what¢s in Perl), but I still don¢t know
what¢s going  
on. As for how it should work, I have no idea...just as long
as I can  
use it. 

> Quickly...
>
> I'm pretty close to a coherent API for Weight.
>
> [... blah blah blah...]
>
> What's left to refactor ....

Could you include a way for a set of terms to be treated as
a single  
term with regard to scoring, i.e., as if ¡fool¢ and ¡food¢
(in a  
wildcard foo* match, for instance) were simple stored as
¡foo¢ in the  
index (the way word stemming works)? (If I¢m not making
myself clear,  
please let me know.) If you don¢t want to include this in
core  
KinoSearch, could you at least bear this in mind? This
would, I  
believe, affect the way doc_freq is calculated.


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: Re: Wildcards
country flaguser name
United States
2008-02-15 15:16:44
On Feb 14, 2008, at 8:36 PM, Father Chrysostomos wrote:

>> Could you include a way for a set of terms to be
treated as a  
>> single term with regard to scoring, i.e., as if
‘fool’ and  
>> ‘food’ (in a wildcard foo* match, for instance)
were simple stored  
>> as ‘foo’ in the index (the way word stemming
works)?

The way the index is laid out, each term gets its own
posting list  
with its own set of ascending document numbers.  Scorers
have to  
iterate through document numbers in ascending order.  The
only way to  
combine multiple posting lists is to interleave the doc num
sets.    
The only search-time options are 1) run through each set and
build up  
a superset before the Scorer starts iterating, or 2) put
multiple  
PostingList objects into a priority queue sorted by
ascending doc num.

Another approach is to break all terms into all possible
substrings at  
index time and store them in a separate
"substrings" field.  The size  
of the index will explode, but then "foo*" becomes
a simple term query  
for "foo".

> (If I’m not making myself clear, please let me know.)
If you don’t  
> want to include this in core KinoSearch, could you at
least bear  
> this in mind? This would, I believe, affect the way
doc_freq is  
> calculated.

Yes, doc_freq is a difficult problem to solve with
wildcards.

It's particularly hard when you get to dealing with several
indexes  
across multiple machines.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


[1-10] [11-15]

about | contact  Other archives ( Real Estate discussion Medical topics )