List Info

Thread: word pair indexing and querying




word pair indexing and querying
user name
2006-09-21 11:24:01
Hi,

Is it possible, preferably using the simplistic scriptindex
and
cgi-bin/omega approach, to create and query a database so
that I can
force matches to only occur for word pairs.

For example I would want a match for "garden
centre" but no match at
all, or perhaps just a low relevance match, for the query
"garden" or
"centre".  Whereas my current approach using an
indexscript with
something like the following:

name : truncate=100 field=caption boolean=name index

and a data file with:

name=garden centre

means that I get 100% relevance matches for any of
"garden", "centre" or
"garden centre", which is rather unfortunate in
my case.

Any thoughts/ideas/cunning plans would be appreciated.

Mark



____________________________________________________________
____________
This email has been scanned for all known viruses by the
MessageLabs SkyScan service.

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
word pair indexing and querying
user name
2006-09-21 12:01:31
On Thu, Sep 21, 2006 at 12:24:01PM +0100, Mark Hagger wrote:

> Is it possible, preferably using the simplistic
scriptindex and
> cgi-bin/omega approach, to create and query a database
so that I can
> force matches to only occur for word pairs.

You can set the default operator in omega to AND instead of
OR. (Set
DEFAULTOP as an argument to the omega CGI.)

You can also use a phrase search, although this will be
slower.

> For example I would want a match for "garden
centre" but no match at
> all, or perhaps just a low relevance match, for the
query "garden" or
> "centre".  Whereas my current approach
using an indexscript with
> something like the following:
> 
> name : truncate=100 field=caption boolean=name index
> 
> and a data file with:
> 
> name=garden centre
> 
> means that I get 100% relevance matches for any of
"garden", "centre" or
> "garden centre", which is rather
unfortunate in my case.
> 
> Any thoughts/ideas/cunning plans would be appreciated.

Have you tried this with real data? Working with short test
documents
often won't give you a realistic idea of what will actually
happen.

Out of interest, why are you doing boolean=name? boolean=S
would be
more usual, particularly if you want to use omega to search
it.

J

-- 
/-----------------------------------------------------------
---------------\
  James Aylett                                              
   xapian.org
  jamestartarus.org                              
uncertaintydivision.org

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
word pair indexing and querying
user name
2006-09-21 13:17:40
Ah, no, I see why you suggested that, but on reflection
perhaps I didn't
explain my requirements very well.

The problem with the DEFAULTOP=AND approach is that then a
query for
"garden centres bristol" or indeed
"bristol garden centres that sell red
plants" will not match the "garden
centres" record, for obvious reasons.

In essence there are a number of cases where I'd like to
add boolean
keywords to the index for a record that are actually
multi-word
keywords, ie any of the individual words in isolation of the
multi-word
sequence are not enough to give a (good) match, but still
allow an
overall OR type query.

Consider the example of a "wifi hotspot" record,
I'd notionally like the
5 keywords:

wifi
wi fi
wi fi hotspot
wi fi hot spot
wifi hot spot

But clearly it would be less than useful for a query for
"wi cake
sellers" to match this record, nor indeed a search for
"red spot on
chin" to match.

Mark

On Thu, 2006-09-21 at 13:01 +0100, James Aylett wrote:
> On Thu, Sep 21, 2006 at 12:24:01PM +0100, Mark Hagger
wrote:
> 
> > Is it possible, preferably using the simplistic
scriptindex and
> > cgi-bin/omega approach, to create and query a
database so that I can
> > force matches to only occur for word pairs.
> 
> You can set the default operator in omega to AND
instead of OR. (Set
> DEFAULTOP as an argument to the omega CGI.)
> 
> You can also use a phrase search, although this will be
slower.
> 
> > For example I would want a match for "garden
centre" but no match at
> > all, or perhaps just a low relevance match, for
the query "garden" or
> > "centre".  Whereas my current approach
using an indexscript with
> > something like the following:
> > 
> > name : truncate=100 field=caption boolean=name
index
> > 
> > and a data file with:
> > 
> > name=garden centre
> > 
> > means that I get 100% relevance matches for any of
"garden", "centre" or
> > "garden centre", which is rather
unfortunate in my case.
> > 
> > Any thoughts/ideas/cunning plans would be
appreciated.
> 
> Have you tried this with real data? Working with short
test documents
> often won't give you a realistic idea of what will
actually happen.
> 
> Out of interest, why are you doing boolean=name?
boolean=S would be
> more usual, particularly if you want to use omega to
search it.
> 
> J
> 


____________________________________________________________
____________
This email has been scanned for all known viruses by the
MessageLabs SkyScan service.

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
word pair indexing and querying
user name
2006-09-21 13:29:29
On Thu, Sep 21, 2006 at 02:17:40PM +0100, Mark Hagger wrote:

> The problem with the DEFAULTOP=AND approach is that
then a query for
> "garden centres bristol" or indeed
"bristol garden centres that sell red
> plants" will not match the "garden
centres" record, for obvious reasons.

True.

> In essence there are a number of cases where I'd like
to add boolean
> keywords to the index for a record that are actually
multi-word
> keywords, ie any of the individual words in isolation
of the multi-word
> sequence are not enough to give a (good) match, but
still allow an
> overall OR type query.

You could do this by adding terms that were generated from
the
multi-word keywords, and upweight them -- wdf>1 in
Document::add_posting() -- then magically add them in to
your
queries. You'd need to customise both indexing and query
generation. (You don't need boolean there - indeed, boolean
will give
you precisely the wrong behaviour, I think.)

Have you considered playing with the parameters to the BM25
weighting
scheme? This currently isn't exposed through omega in a
configurable
way, although it could be fairly easily.

> Consider the example of a "wifi hotspot"
record, I'd notionally like the
> 5 keywords:
> 
> wifi
> wi fi
> wi fi hotspot
> wi fi hot spot
> wifi hot spot
> 
> But clearly it would be less than useful for a query
for "wi cake
> sellers" to match this record, nor indeed a
search for "red spot on
> chin" to match.

There are two ways of approaching this - the above is one.
You can
also do it by not trying to exclude the less helpful
matches, and
merely trying to ensure that more relevant matches get a
higher rank
(ie: that Xapian actually gives them higher relevance in the
match
set).

If you have a very specific query domain, you could identify
significant keywords and upweight them (you need a modified
indexer
for this). There may also be something you can do by
parametrising
BM25Weight, again (or perhaps a combination of the two).

It's difficult to suggest general approaches to this, of
course. Others may have some more useful comments at this
stage 

J

-- 
/-----------------------------------------------------------
---------------\
  James Aylett                                              
   xapian.org
  jamestartarus.org                              
uncertaintydivision.org

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
word pair indexing and querying
user name
2006-09-21 16:14:29
On Thu, Sep 21, 2006 at 02:17:40PM +0100, Mark Hagger wrote:
> In essence there are a number of cases where I'd like
to add boolean
> keywords to the index for a record that are actually
multi-word
> keywords, ie any of the individual words in isolation
of the multi-word
> sequence are not enough to give a (good) match, but
still allow an
> overall OR type query.

I'd suggest just using an OR query with a percentage weight
cut-off.
Then anything significantly less good than the best match
will get
rejected.

Cheers,
    Olly

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
word pair indexing and querying
user name
2006-09-21 16:42:38
Except thats not really going to work very well, here's an
example query
on one of our development databases:

http://staging.gjm.info/cgi-bin/o
mega?P=centre&DB=Business52-GB&FMT=xml

This gives a 100% relevance hit against "job
centre", so not much scope
for a cut-off there, and for the record in this application
I'd need
this query to produce either nothing or at worst a low
relevance hit
against "job centre".

(I would point out that this database has very little in it,
just under
100 records.)

It is starting to look suspiciously as if xapian just isn't
going to be
the way to go here, in truth even the biggest dataset that
I'd be
playing with here won't be more than about 100k records.

Back to pondering.

Mark




On Thu, 2006-09-21 at 17:14 +0100, Olly Betts wrote:
> On Thu, Sep 21, 2006 at 02:17:40PM +0100, Mark Hagger
wrote:
> > In essence there are a number of cases where I'd
like to add boolean
> > keywords to the index for a record that are
actually multi-word
> > keywords, ie any of the individual words in
isolation of the multi-word
> > sequence are not enough to give a (good) match,
but still allow an
> > overall OR type query.
> 
> I'd suggest just using an OR query with a percentage
weight cut-off.
> Then anything significantly less good than the best
match will get
> rejected.
> 
> Cheers,
>     Olly
> 
>
____________________________________________________________
____________
> This email has been scanned for all known viruses by
the MessageLabs SkyScan service.


____________________________________________________________
____________
This email has been scanned for all known viruses by the
MessageLabs SkyScan service.

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
word pair indexing and querying
user name
2006-09-21 17:00:48
On Thu, Sep 21, 2006 at 05:42:38PM +0100, Mark Hagger wrote:
> Except thats not really going to work very well,
here's an example query
> on one of our development databases:
> 
> http://staging.gjm.info/cgi-bin/o
mega?P=centre&DB=Business52-GB&FMT=xml
> 
> This gives a 100% relevance hit against "job
centre", so not much scope
> for a cut-off there, and for the record in this
application I'd need
> this query to produce either nothing or at worst a low
relevance hit
> against "job centre".

TBH, for a query as vague as "centre",
"job centre" seems a good match
to me.  As a user, I'm not sure what I'd be expecting to
get for a query
for "centre"...

Currently if we have a match which includes all the terms in
the query
we peg that as 100% and scale other matches proportionally. 
If the
highest scoring match doesn't include all the terms in the
query, we
make its percentage score depend on the weights of the terms
which
do and don't match.

For some weighting schemes, working out percentages of the
"max_possible" weight
(Enquire::get_max_possible()) might be a better
approach, but for BM25, max_possible is generally
substantially higher
than any weight you get in real situations which is why we
use the
scheme above.

> (I would point out that this database has very little
in it, just under
> 100 records.)
> 
> It is starting to look suspiciously as if xapian just
isn't going to be
> the way to go here, in truth even the biggest dataset
that I'd be
> playing with here won't be more than about 100k
records.

You seem to have very short documents and a very particular
idea of what
constitutes a good match.

With so few records, you can afford to run multiple Xapian
queries
or perform post-processing of results which might help.  If
you're happy
with the ordering of the hits, you could for example look at
the
weights returned and how many terms match, and max_possible
and compute
your own relevance percentage and when to stop showing
matches.

Cheers,
    Olly

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
word pair indexing and querying
user name
2006-09-21 17:11:48
This is interesting, and something which has come up a
couple of times now
(though I'm still not entirely sure why people want it -
I'll come to that
later).

What Xapian does (for the simple case of a search which is a
set of terms
ORred together) is to search through the database for all
documents
containing the terms in the query, and calculate a score for
each document
based on the terms in the query which are contained in the
document.

The document length is taken into account to some extent,
but what you
appear to want is for a document containing terms which are
_not_ in the
query to be heavily penalised.

This isn't the usual intention with a Xapian search - the
idea is that even
if only some parts of the document are relevant, those parts
are worth
returning.

I think we _could_ produce the kind of result you want using
a custom
weighting object (or, possibly just using the right
parameters to the
standard BM25Weight object).  However, the document weights
are normalised
after the match process by checking how many of the query
terms occur in
the top ranked document: thus a query which contains only
one term will
always give a score of 100% to its top ranked document.

In the test case you're working on, if there was a document
in the database
which contained _only_ the term "centre", this
docuument would be returned
by the search at 100%, and the "job centre"
document would be returned with
(slightly) lower score.

I have to ask though: what is wrong with a search for
"centre" returning a
document about "job centre"?  Would you object
to a search for "job"
returning the document?  Do you think your users would
really be confused
by a search for "centre" returning  a document
about "job centre"?  If you
can explain why this is a problem, we'll be more able to do
something about
it.

-- 
Richard

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
word pair indexing and querying
user name
2006-09-21 17:12:03
On Thu, Sep 21, 2006 at 05:42:38PM +0100, Mark Hagger wrote:

> It is starting to look suspiciously as if xapian just
isn't going to be
> the way to go here, in truth even the biggest dataset
that I'd be
> playing with here won't be more than about 100k
records.

How are you proposing to identify (automatically!) the
phrases you
consider more important if matched in entirety? I think
that's crucial
to finding a solution...

(Hmm ... we don't have a way of giving one half of a
Query() branch a
bigger weight than another, do we? I'm thinking of having
an operator
which effectively just frobs q_t appropriately in BM25.)

J

-- 
/-----------------------------------------------------------
---------------\
  James Aylett                                              
   xapian.org
  jamestartarus.org                              
uncertaintydivision.org

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
word pair indexing and querying
user name
2006-09-21 17:14:35
On Thu, Sep 21, 2006 at 06:12:03PM +0100, James Aylett
wrote:
> (Hmm ... we don't have a way of giving one half of a
Query() branch a
> bigger weight than another, do we? I'm thinking of
having an operator
> which effectively just frobs q_t appropriately in
BM25.)

We don't currently, but such a thing would be very easy to
implement.  The
hardest part would be making a reasonable interface for it
in the API.

-- 
Richard

_______________________________________________
Xapian-discuss mailing list
Xapian-discusslists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
[1-10] [11-16]

about | contact  Other archives ( Real Estate discussion Medical topics )