|
List Info
Thread: Re: RangeFilter false positives
|
|
| Re: RangeFilter false positives |
  United States |
2007-04-13 01:42:33 |
At 18:47 -0700 2007.04.12, Marvin Humphrey wrote:
>It turns out your fix was "wrong" in principle
but induced correct
>behavior -- it detoured around some bogus code rather
than
>eliminating the bogus code.
Yeah, I figured. But I figured it also might help you
narrow it all down.
Even when I saw the first commit come through, I thought a
fix for it was
probably coming next.
>Would you be interested in adding a new feature to
RangeFilter's
>API? If you can successfully modify the new code and
feel confident
>that you got it right, that would indicate that the
refactoring went
>well.
>
>Right now, all five constructor arguments are required.
However, we
>should have an omitted or undef value for lower_term
indicate "no
>lower bound" and an omitted or undef value for
upper_term indicate
>"no upper bound".
>
>Similarly, we should enable default false values for
include_lower
>and include_upper.
Sure ... I am just not entirely sure what the resulting
value should be for
"no lower bound" and "no upper bound."
Should no lower be -2? 0? And
should no upper be ... num_docs?
Or should this instead be smarter and skip those bounds
checks? For
example, return "-3" to mean no bound check,
then:
// no upper or lower bound
if (data->lower_bound == -3 &&
data->upper_bound == -3) {
data->inner_coll->collect(data->inner_coll,
doc_num, score);
}
// no lower bound
else if (data->lower_bound == -3 && locus
<= data->upper_bound) {
data->inner_coll->collect(data->inner_coll,
doc_num, score);
}
// no upper bound
else if (data->upper_bound == -3 && locus
>= data->lower_bound) {
data->inner_coll->collect(data->inner_coll,
doc_num, score);
}
else if (locus >= data->lower_bound &&
locus <= data->upper_bound) {
data->inner_coll->collect(data->inner_coll,
doc_num, score);
}
>Subversion trunk now tests clean under valgrind and is
safe to use
>again.
Awesome.
At 18:52 -0700 2007.04.12, Marvin Humphrey wrote:
>On Apr 12, 2007, at 6:47 PM, Marvin Humphrey wrote:
>> Similarly, we should enable default false values
for include_lower
>> and include_upper.
>
>Um, make that *true* values. Having "from A to
Z" include A and Z by
>default is clearly the most intuitive behavior.
Nod. I am sometimes a poor judge of what is most intuitive
to others, but
that is what I would expect.
--
Chris Nandor pudge pobox.com
http://pudge.net/
Open Source Technology Group pudge ostg.com
http://ostg.com/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: RangeFilter false positives |
  United States |
2007-04-13 02:09:10 |
Oh, and here are the patches I have for this. I am not
crazy about the -3
thing, but it seems to work, so I included that until I get
guidance from
you.
Also, a RangeFilter with no bounds at all seems silly, so
not sure if you
want to return an error, or let the caller just make a range
filter that
includes everything. For now, such a "silly"
filter is explicitly allowed.
--
Chris Nandor pudge pobox.com
http://pudge.net/
Open Source Technology Group pudge ostg.com
http://ostg.com/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
|
| Re: RangeFilter false positives |
  United States |
2007-04-13 02:14:47 |
Argh. Small error in test. New patch. Bedtime now.
--
Chris Nandor pudge pobox.com
http://pudge.net/
Open Source Technology Group pudge ostg.com
http://ostg.com/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
|
| Re: RangeFilter false positives |
  United States |
2007-04-13 02:17:35 |
On Apr 12, 2007, at 11:42 PM, Chris Nandor wrote:
> Sure ... I am just not entirely sure what the resulting
value
> should be for
> "no lower bound" and "no upper
bound." Should no lower be -2? 0?
> And
> should no upper be ... num_docs?
No lower should probably be 0.
No upper should be Reader->max_doc.
The range filter uses an IntMap object, which is essentially
just an
array of i32_t. Array index corresponds to doc num and
value
corresponds to "term number".
Term number is the would-be array index of the term if all
the terms
in that field were marshalled into an array. Docs without a
term in
that field are assigned -1 by default.
-1 is the lowest possible value that can appear in the
array;
therefore assigning -1 or less to lower_bound means that no
document
will be excluded by the lower bound test, even those with no
value
for the field. Also, assigning -2 to the upper_bound means
that no
hits will be collected.
Reader->max_doc works out to 1 greater than the highest
possible term
number. RangeFilter can only be used against an unanalyzed
field,
which means we have at most one unique term per document.
Reader-
>max_doc is 1 greater than the maximum number of
documents;
therefore, it is also 1 greater than the maximum number of
terms in
an unanalyzed field.
>
> Or should this instead be smarter and skip those bounds
checks? For
> example, return "-3" to mean no bound check,
then:
>
> // no upper or lower bound
> if (data->lower_bound == -3 &&
data->upper_bound == -3) {
>
data->inner_coll->collect(data->inner_coll,
doc_num, score);
> }
> // no lower bound
> else if (data->lower_bound == -3 &&
locus <= data->upper_bound) {
>
data->inner_coll->collect(data->inner_coll,
doc_num, score);
> }
> // no upper bound
> else if (data->upper_bound == -3 &&
locus >= data->lower_bound) {
>
data->inner_coll->collect(data->inner_coll,
doc_num, score);
> }
> else if (locus >= data->lower_bound
&& locus <= data-
> >upper_bound) {
>
data->inner_coll->collect(data->inner_coll,
doc_num, score);
> }
This is an implementation detail, so it's not something I'm
gonna
sweat too hard. However, my mild preference is to resolve
this in
RangeFilter.pm and keep the HitCollector code, which is a
performance-
critical inner loop, as simple as possible. These
conditionals
aren't going to make a difference, but as a general rule, I
try to
keep the inner loop stuff uncluttered.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: RangeFilter false positives |
  United States |
2007-04-13 02:21:46 |
OK, new patch coming soon, ignore the others.
--
Chris Nandor pudge pobox.com
http://pudge.net/
Open Source Technology Group pudge ostg.com
http://ostg.com/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
[1-5]
|
|