List Info

Thread: Re: subclassing term scorers




Re: subclassing term scorers
country flaguser name
United States
2007-07-18 21:35:25
On Jul 18, 2007, at 6:27 PM, Nathan Kurz wrote:

> I've rearranged my responses to emphasize our agreement
.

Well done! ;)

>> Lastly, Posting and PostingList also happen to
align well with IR
>> theory, making for what seems to me is a more
coherent conceptual OO
>> model than Lucene's TermDocs/TermPositions.
>
> Yes, I agree.  I'm not intimately familiar with
Lucene's model apart
> from via yours, but Posting and PostingList inherently
make sense.

In the TermDocs/TermPositions model, the traits are added to
the  
iterator itself.

    while (termDocs.next()) {
      system.out.println("DOC: "  +
termDocs.doc());
      system.out.println("FREQ: " +
termDocs.freq());
    }

    while (termPositions.next()) {
      system.out.println("DOC: "  +
termPositions.doc());
      int freq = termPositions.freq());
      system.out.println("FREQ: " + freq);
      while (freq--) {
        int position = termPositions.nextPosition();
        system.out.println("POS: " + position);
        if (termPositions.isPayloadAvailable()) {
          byte[] payload = termPositions.getPayload(null,
0);
          printPayloadSomeHow(payload);
        }
      }
    }

There isn't an object which represents a posting.

Another significant difference is that Lucene iterates over
positions  
one at a time via nextPosition(), while KS loads them all
into memory  
at once.

>>    * a write method
>>    * a read method
>>    * a make_scorer method
>>    * a TermScorer subclass that overrides
Scorer_Tally
>
> Here's where we separate a little.  I'd like to make it
even simpler,
> and require only that it define a read method (and
presumably a write
> method, although I've thought very little about that
side).

Yes, you could do that.  Presumably, the subclass would
interpret the  
same postings file data differently somehow from the parent
class.

> A new scorer could be defined to make use of new
information in new  
> Posting,
> but this would be optional.

You're right.  In general that would work, provided that the
subclass  
was serious about fulfilling the parent class's interface.

> A subclassed Posting can continue to use
> the Scorer used by its parent.  Thus if if ScorePosting
is a
> descendant of MatchPosting, MatchPostingScorer can
call
> ScorePosting->read() and end up with a Posting it
can handle.

I can't think of a reason why this wouldn't work. 
Boilerplater  
implements single inheritance only, a very limited OO model.
 There's  
a little trickiness in there -- RichPosting's file format
doesn't  
"inherit" from ScorePosting's, for instance...

   <doc, freq, shared_boost, <position>+>+
   <doc, freq, <position, boost>+>+

... and the generated posting->impact would presumably
differ (that's  
the whole point of RichPosting after all).  But the C
structs would  
be compatible.

>> The intent is that each Posting subclass will have
a fixed
>> association with a corresponding TermScorer
subclass.  You're not
>> supposed to be able to override that association
without additional
>> subclassing.
>
> This I don't like.   I can see how you got here, but I
think there is
> a better solution: the TermScorers depend only on the
format of the
> Posting struct, and Posting->read() is the sole
point of conversion
> from Index as file to Posting as object.

Well put.  You've persuaded me.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/



_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: subclassing term scorers
user name
2007-07-18 23:06:13
On 7/18/07, Marvin Humphrey <marvinrectangular.com> wrote:
>
> Well put.  You've persuaded me.
>

Wonderful!  I'm sure there are details left, but I think we
can safely
wait until later to discuss them, and concentrate on
positions.

I said I'd send you more thoughts tonight, but I'm not going
to have
time.  Things came up (happy things:  an opportunity to go
out tuna
fishing leaving tomorrow morning at 2:30 am) so I'm not
going to get
to it until Friday.

The gist of my thoughts was that:

1) I think we can get away with a single flat array of
positions
rather than a complex structure.

2) All the Booleans can should be able to pass positions by
reference,
so no memory troubles foreseen there.

3) PhraseScorer and the like end up reducing the number of
positions
so shouldn't be a problem.

4) I haven't been able to think of any geometric space
issues.

5) Geometric time:  that could be a problem, which is why
I'm
concerned with avoiding unnecessary work on unused
branches.

I think your test case is a good one to think about.

Possibly disappearing until Friday,

Nathan Kurz
nateverse.com

_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )