List Info

Thread: 9476: trunk/xapian-core/ trunk/xapian-core/include/xapian/ trunk




9476: trunk/xapian-core/ trunk/xapian-core/inc lude/xapian/ trunk
country flaguser name
United Kingdom
2007-10-19 05:20:49
olly wrote:
> SVN root:       svn://svn.xapian.org/xapian
> Changes by:     olly
> Revision:       9476
> Date:           2007-10-19 03:47:11 +0100 (Fri, 19 Oct
2007)
> 
> Log message (14 lines):
>
include/xapian/queryparser.h,queryparser/queryparser.cc,
>
queryparser/queryparser.lemony,queryparser/queryparser_inter
nal.h,
> tests/queryparsertest.cc: Since calling
QueryParser::add_prefix()
> or QueryParser::add_boolean_prefix() a second time with
the same
> field name was ignored before (rather than overriding
as we had
> thought) it seems reasonable to change this behaviour. 
This
> also avoids the need to deprecate these methods which
will force all
> users to update their code.

I think leaving the old methods in place and undeprecated
with these 
semantics is fine, but I'd argue the the new, 3 argument,
form of 
add_prefix, and the prefix_type enum are good additions
because they 
lead to future improvements (although, they're possibly not
needed yet).

I thought that the consensus on the query parser was that it
should 
become more configurable, and one of the ways in which it
should be more 
configurable is being able to generate terms according to
the prefixes 
used in a more flexible way.  In particular, it should be
possible to 
decouple the status of a prefix as a "filter"
prefix from the way in 
which terms are generated from the text.


Currently, we generate terms in two ways:

1. By taking the text after the field specifier, and
processing it as 
free text (stemming, lowercase conversion, end at
punctuation or space, 
may generate phrases).

2. By taking the text after the field specifier, and
processing it as a 
single unit (currently, lowercase conversion only, end at
space, no 
phrases).


We also use these terms in two ways:

A. Involving the term in the query (so, joining it with
surrounding 
terms according to boolean operators, and using the default
operator if 
there aren't any such operators).

B. Adding the term to a list of filter terms, which are then
combined 
with the rest of the query to restrict the set of matching
documents, 
but not used to affect rankings.  Surrounding boolean
operators are 
ignored or invalid (can't remember which).

Currently, add_prefix(field, prefix) allows the combination
of "1" and 
"A" above to be used, and
add_boolean_prefix(field, prefix) allows the 
combination of "2" and "B" above to be
used.

There is really no reason that options 1B and 2A shouldn't
be available: 
for example, option 2A on field "code" would allow
something like

  "codekcd.2
354/sad OR richard"

to generate the query "XCODEkcd.2
354/sad OR richard"; ie, be used as a 
normal weighted term, and included in boolean operators.


My idea when introducing the prefix_type enum, and when
naming its 
values as PREFIX_INLINE and PREFIX_FILTER, was that the
prefix_type 
would determine how the term was "mixed" into the
query (so, 
PREFIX_INLINE correlated to option "A" above, and
PREFIX_FILTER 
correlated to option "B" above).  My patch also
hard-coded the 
combination of option A with "1" and option B with
"2", but my thinking 
was that we could later add further parameters to
add_prefix() 
specifying how the term generation should be done. 
Probably, this would 
simply involve adding a single extra parameter holding a
"TermProcessor" 
class, which would generate a term, (or possibly a query),
from a piece 
of text.  There would be default TermProcessor classes which
followed 
option "1" or option "2" above, and the
user would be free to define 
further such classes.


Now, it's reasonable to argue, since you're happy with
changing the 
semantics of add_prefix and add_boolean_prefix, that we
don't need the 
new form of add_prefix() just yet, since it doesn't add any
new 
functionality; there's only any point in adding the new form
of 
add_prefix() once we have an implementation which takes 4
arguments, 
allowing "TermProcessors" to be added.  I'm not
too bothered about this 
right now, but it does seem that add_boolean_prefix() will
then become a 
limited form of add_prefix() and would eventually be
deprecated to clean 
up the API, so adding a 3 argument form of add_prefix() now
to allow 
users to start using it instead of add_boolean_prefix()
doesn't seem 
unreasonable either.

The main concern I have is to check that we're still heading
in the same 
direction: making the query parser more configurable, and
decoupling the 
status of a prefix as an inline or a filter prefix from the
way in which 
term processing is done.


However, I would say that changing from using an enum
internally to 
using a boolean value for the prefix type seems like a
backwards step - 
use of an enum makes the code more readable, in my opinion. 
I'd be 
happier if the enum was still used internally (though not
exposing it in 
the API until we have a 3 or 4 argument form of add_prefix()
is 
perfectly sensible).

-- 
Richard

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: 9476: trunk/xapian-core/ trunk/xapian-core/include/xapian/ t
country flaguser name
United Kingdom
2007-10-19 08:53:25
On Fri, Oct 19, 2007 at 11:20:49AM +0100, Richard Boulton
wrote:
> olly wrote:
> >Since calling QueryParser::add_prefix()
> >or QueryParser::add_boolean_prefix() a second time
with the same
> >field name was ignored before (rather than
overriding as we had
> >thought) it seems reasonable to change this
behaviour.  This
> >also avoids the need to deprecate these methods
which will force all
> >users to update their code.
> 
> I think leaving the old methods in place and
undeprecated with these 
> semantics is fine, but I'd argue the the new, 3
argument, form of 
> add_prefix, and the prefix_type enum are good additions
because they 
> lead to future improvements (although, they're possibly
not needed yet).

The API as it is after this change is the original one you
committed
(with one small semantic change).  You only changed it
because I pointed
out that the behaviour it changed might actually be being
used.  But I
had thought that setting the same field to a different
prefix changed
it, whereas actually it's ignored, so the whole premise for
changing it
was flawed.

I tried to bring this up on IRC, but got no response, so I
went ahead
as it didn't seem controversial (it was your original API
after all),
and SVN HEAD is currently much too far from being in a
releasable state,
which I want to address.  Ideally we should be able to make
a release at
short notice if we need to.

> 2. By taking the text after the field specifier, and
processing it as a 
> single unit (currently, lowercase conversion only, end
at space, no 
> phrases).

FWIW, it also ends at a close bracket to avoid swallowing
them when used at
the end of a bracketted subexpression.

> [...]  I'm not too bothered about this 
> right now, but it does seem that add_boolean_prefix()
will then become a 
> limited form of add_prefix() and would eventually be
deprecated to clean 
> up the API, so adding a 3 argument form of add_prefix()
now to allow 
> users to start using it instead of add_boolean_prefix()
doesn't seem 
> unreasonable either.

I think it's better not to try to predict the API we'll want
here before
we actually try to implement it.  We may get it wrong and
force users to
update their add_prefix() calls multiple times.  Deprecating
a method in
favour of another has a cost for everyone using it, so it's
not
something we should do too lightly.

> The main concern I have is to check that we're still
heading in the same 
> direction: making the query parser more configurable,
and decoupling the 
> status of a prefix as an inline or a filter prefix from
the way in which 
> term processing is done.

I'm not sure I'd use the word decouple, as there are
essential
differences between the two (for example, it doesn't make
sense to
have the empty field prefix being a filter, nor to have a
phrase of
filters).  But certainly this should be much more
configurable.

> However, I would say that changing from using an enum
internally to 
> using a boolean value for the prefix type seems like a
backwards step - 
> use of an enum makes the code more readable, in my
opinion.

FWIW, I find it slightly more readable as currently used -
the choice is
essentially:

    foo->filter

versus:

    foo->type == TYPE_FILTER

It also avoids the need to Assert(foo->type ==
TYPE_INLINE); after
dealing with the boolean filter case.

But the main reason I changed it was that I find
"TYPE_INLINE" a rather
confusing name, and couldn't think of a better one.  To me,
TYPE_INLINE
suggests a boolean filter inline in the query string
(constrasted with
a boolean filter outside the query (e.g. in an HTML
SELECT).

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )