|
List Info
Thread: Initial patch for ExternalPostList
|
|
| Initial patch for ExternalPostList |

|
2006-06-03 00:24:12 |
|
| Hi Everybody,
Here is the first version of my match for an ExternalPostList, it should apply cleanly to 0.9.5 and 0.9.6.
You can use it by first implementing an ExternalPostingSource, then creating a new Query object passing a reference an instance of your implementation to the constructor, see query.h. The ExternalPostingSource implementation is reference counted, so when its no longer needed it can clean itself up.
It works well for filtering results but it has a problem that once it filters all of the results are returned with 100% relevancy. I'm not quite sure how to fix that but I think its because I just used the default weight implementations in matcher/externalsourcepostlist.h. Any hints or patches to fix that so it works correctly would be greatly appreciated.
Also the use of an ExternalPostList query breaks the network method of transmitting queries, I haven't written code to serialize the post list or even decide if that's worth doing.
I realize there has been some stylistic license taken with the formatting of this patch. I didn't pay much attention to following existing indentation and bracket rules and what not. If you're up for including this into the release branch, I'd be glad to take the time and clean it up even further.
Thanks for all of your help,
Rusty -- Rusty Conover InfoGears Inc.
|
| Initial patch for ExternalPostList |

|
2006-06-03 02:00:37 |
On Fri, Jun 02, 2006 at 06:24:12PM -0600, Rusty Conover
wrote:
> You can use it by first implementing an
ExternalPostingSource, then
> creating a new Query object passing a reference an
instance of your
> implementation to the constructor, see query.h. The
> ExternalPostingSource implementation is reference
counted, so when
> its no longer needed it can clean itself up.
If you're going to reference count it, it would be better
to use the
existing reference counting machinery
(include/xapian/base.h) unless
there's a good reason not to.
> It works well for filtering results but it has a
problem that once it
> filters all of the results are returned with 100%
relevancy. I'm not
> quite sure how to fix that but I think its because I
just used the
> default weight implementations in
matcher/externalsourcepostlist.h.
> Any hints or patches to fix that so it works correctly
would be
> greatly appreciated.
I think the weights look OK, but get_termfreq_*() shouldn't
return
0 but rather bounds/estimates of the number of entries that
the
ExternalPostingSource would return if you called next()
until
at_end() returned true.
I suspect returning 0 might confuse the matcher.
> Also the use of an ExternalPostList query breaks the
network method
> of transmitting queries, I haven't written code to
serialize the post
> list or even decide if that's worth doing.
I think this is probably a concept which doesn't fit well
with the
remote backend. You could perhaps use a registering scheme
similar
to that provided for user-defined weighting schemes, but a
weighting
scheme is just an algorithm plus some parameters, whereas
this class
will typically be talking to some other system like an SQL
database.
> I realize there has been some stylistic license taken
with the
> formatting of this patch. I didn't pay much attention
to following
> existing indentation and bracket rules and what not.
If you're up
> for including this into the release branch, I'd be
glad to take the
> time and clean it up even further.
I think it's probably suitable for inclusion with a bit
more work.
Similar ideas have been discussed before, and I can see
scope for
lots of neat things this enables.
I've had a quick read through - here are some comments:
* I don't think you need to call start_construction(), etc
and
have the set_external_source method - just have a
Query::Internal::Internal(ExternalPostingSource *) ctor
for the internal class.
* I think OP_EXTERNAL_POST_LIST should be internal like
OP_LEAF.
* There's no point trying to encode OP_EXTERNAL_POST_LIST
in
the query serialisation if this isn't supported by the
remote backend
(especially as I've just rewritten query serialisation
but not checked
it in yet!) Better to just throw UnimplementedError if we
get there.
* ExternalPostingSource should derive from RefCntBase and
then you can
use the same reference counting mechanism the other API
classes do.
* Don't renumber OP_ELITE_SET - the java bindings currently
have these
values hard-wired (ick) so they'll break! That's why
it's explicitly
numbered currently (the enum with value 9 is no longer
used).
* In this case, skip_to and next should always return NULL.
You only
need to return something if you want to replace the
current postlist
with a different one (e.g. an OR turning into an AND
because of
optimisations based on the minimum weight required) and we
don't want
to do that here.
Phew! That might seem like a lot, but they're all pretty
minor really.
Generally this is looking pretty plausible.
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Initial patch for ExternalPostList |

|
2006-06-03 02:12:22 |
Um, I don't know quite how I managed to attach your patch
to my reply,
but just ignore it - it's identical to the copy you sent to
the list!
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Initial patch for ExternalPostList |

|
2006-06-03 04:34:33 |
|
Pardon my ignorance, but can you explain a little of what one can do
with your addition here? I mean what kind of functionality does it add
to Xapian?
I have read your posts about it, but don't feel like I've wrapped my
head around it entirely 
Alec
Rusty Conover wrote:
infogears.com"
type="cite">Hi Everybody,
Here is the first version of my match for an ExternalPostList,
it should apply cleanly to 0.9.5 and 0.9.6.
You can use it by first implementing an ExternalPostingSource,
then creating a new Query object passing a reference an instance of
your implementation to the constructor, see query.h. The
ExternalPostingSource implementation is reference counted, so when its
no longer needed it can clean itself up.
It works well for filtering results but it has a problem that
once it filters all of the results are returned with 100%
relevancy. I'm not quite sure how to fix that but I think its because I
just used the default weight implementations in
matcher/externalsourcepostlist.h. Any hints or patches to fix that so
it works correctly would be greatly appreciated.
Also the use of an ExternalPostList query breaks the network
method of transmitting queries, I haven't written code to serialize the
post list or even decide if that's worth doing.
I realize there has been some stylistic license taken with the
formatting of this patch. I didn't pay much attention to following
existing indentation and bracket rules and what not. If you're up for
including this into the release branch, I'd be glad to take the time
and clean it up even further.
Thanks for all of your help,
Rusty
--
Rusty Conover
InfoGears Inc.
_______________________________________________
Xapian-devel mailing list
lists.xapian.org">Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
| Initial patch for ExternalPostList |

|
2006-06-03 06:19:39 |
On Jun 2, 2006, at 10:34 PM, Alexander Lind wrote:
> Pardon my ignorance, but can you explain a little of
what one can
> do with your addition here? I mean what kind of
functionality does
> it add to Xapian?
> I have read your posts about it, but don't feel like
I've wrapped
> my head around it entirely
>
The patch allows you to provide a source of xapian doc ids
from an
external source, in my case I use a SQL database, and make
it part of
a query's so that the documents returned will be required
to be a
member of that source. It allows you to search a subset of
a xapian
database pretty easily.
If you still have questions I'd be glad to answer them.
Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Initial patch for ExternalPostList |

|
2006-06-03 16:09:12 |
On Sat, Jun 03, 2006 at 12:19:39AM -0600, Rusty Conover
wrote:
> On Jun 2, 2006, at 10:34 PM, Alexander Lind wrote:
> >Pardon my ignorance, but can you explain a little
of what one can
> >do with your addition here? I mean what kind of
functionality does
> >it add to Xapian?
>
> The patch allows you to provide a source of xapian doc
ids from an
> external source, in my case I use a SQL database, and
make it part of
> a query's so that the documents returned will be
required to be a
> member of that source. It allows you to search a
subset of a xapian
> database pretty easily.
And this is an idea that's been around for a while, for
example:
http://thread.gmane.org/gmane.comp.search.
xapian.general/230/focus=232
I'm sure there are many uses. Here are a few which come to
mind to
give you an idea:
* Restricting search results to those the user has
permission to see.
This can already be done by creating filter terms in
Xapian, but
updating the permissions in Xapian to reflect changes in
the
underlying system can be tricky. If there's no a hook to
tell
you premissions have changed, all you can really do is
perform a full
sweep to check periodically and people generally prefer
permission
changes to be reflected right away.
* A usenet server could allow a search to be restricted to
only those
articles a user has already read (by filtering based on
article
numbers from the .newsrc) - this information is very
dynamic and
different for each user, so it's hard to achieve this
currently.
* Sometimes sites want to be able to quickly remove pages,
including
from the search (for legal reasons perhaps). This class
would allow
entries to be instantly made invisible to searches without
complicating the standard update process. At a convenient
point, you
can drop the entries from the database and remove them
from the
external filter.
* If your documents are added in date order, you can achieve
"sort by
date" very cheaply by using BoolWeight and
Enquire::set_docid_order.
With a simple external map from date to docid, you could
use this
class to implement a similarly cheap "filter by
date" too.
* If the external source can set weight information, this
new class
will actually provide a full implementation of the
MatchBiasFunctor
idea which currently is only present as a
proof-of-concept. This
would allow you to add an extra weight to each document -
for example
in a news search you might want to give a small weight
boost to newer
articles, or you could add a weight contribution based on
link
analysis, click-through rates, etc.
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Initial patch for ExternalPostList |

|
2006-06-03 16:56:33 |
On Sat, Jun 03, 2006 at 05:09:12PM +0100, Olly Betts wrote:
> * Restricting search results to those the user has
permission to see.
> * Sometimes sites want to be able to quickly remove
pages
A common combination of the two is the idea of new pages
being created
first on a staging site, and only later put live. If you're
running a
full revisioned system it gets more complex than this, but I
imagine
this kind of thing will help considerably if anyone, eg,
wanted to
make Xapian available as a search system for something like
Drupal.
J
--
/-----------------------------------------------------------
---------------\
James Aylett
xapian.org
james tartarus.org
uncertaintydivision.org
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Initial patch for ExternalPostList |

|
2006-06-03 17:12:27 |
On Sat, Jun 03, 2006 at 05:56:33PM +0100, James Aylett
wrote:
> On Sat, Jun 03, 2006 at 05:09:12PM +0100, Olly Betts
wrote:
>
> > * Restricting search results to those the user has
permission to see.
> > * Sometimes sites want to be able to quickly
remove pages
>
> A common combination of the two is the idea of new
pages being created
> first on a staging site, and only later put live. If
you're running a
> full revisioned system it gets more complex than this,
but I imagine
> this kind of thing will help considerably if anyone,
eg, wanted to
> make Xapian available as a search system for something
like Drupal.
You mean index a page once it's in the system, but not
allow it to
actually appear in the results until it goes live, thus
allowing
apparently instant indexing of new pages but with the speed
benefits
of adding documents in batches?
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Initial patch for ExternalPostList |

|
2006-06-03 19:38:49 |
> * If the external source can set weight information,
this new class
> will actually provide a full implementation of the
MatchBiasFunctor
> idea which currently is only present as a
proof-of-concept. This
> would allow you to add an extra weight to each
document - for
> example
> in a news search you might want to give a small
weight boost to
> newer
> articles, or you could add a weight contribution
based on link
> analysis, click-through rates, etc.
Hi Olly,
I'm interested in hearing more about how the actual weight
would be
returned for each document.
Thanks,
Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
| Initial patch for ExternalPostList |

|
2006-06-03 20:01:11 |
On Sat, Jun 03, 2006 at 01:38:49PM -0600, Rusty Conover
wrote:
> I'm interested in hearing more about how the actual
weight would be
> returned for each document.
ExternalPostingSource would have the 2 methods get_weight()
and
get_maxweight(). In response to a call to get_weight() it
would
return the weight to contribute to the current document
(i.e.
what get_docid() would currently return).
And a call to get_maxweight() would return an upper bound on
what
get_weight() can return for any docid.
So to bias towards newer articles, you might have something
like
this (which is pretty much what the current match bias stuff
is
hardwired to do):
In the ctor:
W = some positive weight
K = some negative constant which affects the decay rate
now = time(NULL);
Xapian::weight get_maxweight() const {
return W;
}
Xapian::weight get_weight() const {
time_t t = get_timestamp(get_docid());
if (t >= now) {
return W;
}
t = now - t;
return W * exp(K * t);
}
If you only want to use the external source for filtering,
both of these
can just return 0.
Cheers,
Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devel lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
|
|
[1-10]
|
|