List Info

Thread: Initial patch for ExternalPostList




Initial patch for ExternalPostList
user name
2006-06-03 00:24:12
Hi Everybody,

Here is the first version of my match for an ExternalPostList, it should apply cleanly to 0.9.5 and 0.9.6.

You can use it by first implementing an ExternalPostingSource, then creating a new Query object passing a reference an instance of your implementation to the constructor, see query.h. The ExternalPostingSource implementation is reference counted, so when its no longer needed it can clean itself up.

It works well for filtering results but it has a problem that once it filters all of the results are returned with 100% relevancy. I'm not quite sure how to fix that but I think its because I just used the default weight implementations in matcher/externalsourcepostlist.h.  Any hints or patches to fix that so it works correctly would be greatly appreciated.  

Also the use of an ExternalPostList query breaks the network method of transmitting queries, I haven't written code to serialize the post list or even decide if that's worth doing.

I realize there has been some stylistic license taken with the formatting of this patch. I didn't pay much attention to following existing indentation and bracket rules and what not.  If you're up for including this into the release branch, I'd be glad to take the time and clean it up even further.

Thanks for all of your help,

Rusty
--
Rusty Conover
InfoGears Inc.

Initial patch for ExternalPostList
user name
2006-06-03 02:00:37
On Fri, Jun 02, 2006 at 06:24:12PM -0600, Rusty Conover
wrote:
> You can use it by first implementing an
ExternalPostingSource, then
> creating a new Query object passing a reference an
instance of your
> implementation to the constructor, see query.h. The
> ExternalPostingSource implementation is reference
counted, so when
> its no longer needed it can clean itself up.

If you're going to reference count it, it would be better
to use the
existing reference counting machinery
(include/xapian/base.h) unless
there's a good reason not to.

> It works well for filtering results but it has a
problem that once it
> filters all of the results are returned with 100%
relevancy. I'm not
> quite sure how to fix that but I think its because I
just used the
> default weight implementations in
matcher/externalsourcepostlist.h. 
> Any hints or patches to fix that so it works correctly
would be
> greatly appreciated.

I think the weights look OK, but get_termfreq_*() shouldn't
return
0 but rather bounds/estimates of the number of entries that
the
ExternalPostingSource would return if you called next()
until
at_end() returned true.

I suspect returning 0 might confuse the matcher.

> Also the use of an ExternalPostList query breaks the
network method
> of transmitting queries, I haven't written code to
serialize the post
> list or even decide if that's worth doing.

I think this is probably a concept which doesn't fit well
with the
remote backend.  You could perhaps use a registering scheme
similar
to that provided for user-defined weighting schemes, but a
weighting
scheme is just an algorithm plus some parameters, whereas
this class
will typically be talking to some other system like an SQL
database.

> I realize there has been some stylistic license taken
with the
> formatting of this patch. I didn't pay much attention
to following
> existing indentation and bracket rules and what not. 
If you're up
> for including this into the release branch, I'd be
glad to take the
> time and clean it up even further.

I think it's probably suitable for inclusion with a bit
more work.
Similar ideas have been discussed before, and I can see
scope for
lots of neat things this enables.

I've had a quick read through - here are some comments:

* I don't think you need to call start_construction(), etc
and
  have the set_external_source method - just have a
  Query::Internal::Internal(ExternalPostingSource *) ctor
  for the internal class.

* I think OP_EXTERNAL_POST_LIST should be internal like
OP_LEAF.

* There's no point trying to encode OP_EXTERNAL_POST_LIST
in
  the query serialisation if this isn't supported by the
remote backend
  (especially as I've just rewritten query serialisation
but not checked
  it in yet!)  Better to just throw UnimplementedError if we
get there.

* ExternalPostingSource should derive from RefCntBase and
then you can
  use the same reference counting mechanism the other API
classes do.

* Don't renumber OP_ELITE_SET - the java bindings currently
have these
  values hard-wired (ick) so they'll break!  That's why
it's explicitly
  numbered currently (the enum with value 9 is no longer
used).

* In this case, skip_to and next should always return NULL. 
You only
  need to return something if you want to replace the
current postlist
  with a different one (e.g. an OR turning into an AND
because of
  optimisations based on the minimum weight required) and we
don't want
  to do that here.

Phew!  That might seem like a lot, but they're all pretty
minor really.
Generally this is looking pretty plausible.

Cheers,
    Olly
_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Initial patch for ExternalPostList
user name
2006-06-03 02:12:22
Um, I don't know quite how I managed to attach your patch
to my reply,
but just ignore it - it's identical to the copy you sent to
the list!

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Initial patch for ExternalPostList
user name
2006-06-03 04:34:33
Pardon my ignorance, but can you explain a little of what one can do with your addition here? ; I mean what kind of functionality does it add to Xapian?
I have read your posts about it, but don't feel like I've wrapped my head around it entirely

Alec

Rusty Conover wrote:
infogears.com" type="cite">Hi Everybody,

Here is the first version of my match for an ExternalPostList, it should apply cleanly to 0.9.5 and 0.9.6.

You can use it by first implementing an ExternalPostingSource, then creating a new Query object passing a reference an instance of your implementation to the constructor, see query.h. The ExternalPostingSource implementation is reference counted, so when its no longer needed it can clean itself up.

It works well for filtering results but it has a problem that once it filters all of the results are returned with 100% relevancy. I'm not quite sure how to fix that but I think its because I just used the default weight implementations in matcher/externalsourcepostlist.h.  Any hints or patches to fix that so it works correctly would be greatly appreciated.  

Also the use of an ExternalPostList query breaks the network method of transmitting queries, I haven't written code to serialize the post list or even decide if that's worth doing.

I realize there has been some stylistic license taken with the formatting of this patch. I didn't pay much attention to following existing indentation and bracket rules and what not.  If you're up for including this into the release branch, I'd be glad to take the time and clean it up even further.

Thanks for all of your help,

Rusty
--
Rusty Conover
InfoGears Inc.







_______________________________________________ Xapian-devel mailing list lists.xapian.org">Xapian-devellists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-devel
Initial patch for ExternalPostList
user name
2006-06-03 06:19:39
On Jun 2, 2006, at 10:34 PM, Alexander Lind wrote:

> Pardon my ignorance, but can you explain a little of
what one can  
> do with your addition here?  I mean what kind of
functionality does  
> it add to Xapian?
> I have read your posts about it, but don't feel like
I've wrapped  
> my head around it entirely 
>

The patch allows you to provide a source of xapian doc ids
from an  
external source, in my case I use a SQL database, and make
it part of  
a query's so that the documents returned will be required
to be a  
member of that source.  It allows you to search a subset of
a xapian  
database pretty easily.

If you still have questions I'd be glad to answer them.

Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com




_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Initial patch for ExternalPostList
user name
2006-06-03 16:09:12
On Sat, Jun 03, 2006 at 12:19:39AM -0600, Rusty Conover
wrote:
> On Jun 2, 2006, at 10:34 PM, Alexander Lind wrote:
> >Pardon my ignorance, but can you explain a little
of what one can
> >do with your addition here?  I mean what kind of
functionality does
> >it add to Xapian?
> 
> The patch allows you to provide a source of xapian doc
ids from an
> external source, in my case I use a SQL database, and
make it part of
> a query's so that the documents returned will be
required to be a
> member of that source.  It allows you to search a
subset of a xapian
> database pretty easily.

And this is an idea that's been around for a while, for
example:

http://thread.gmane.org/gmane.comp.search.
xapian.general/230/focus=232

I'm sure there are many uses.  Here are a few which come to
mind to
give you an idea:

* Restricting search results to those the user has
permission to see.
  This can already be done by creating filter terms in
Xapian, but
  updating the permissions in Xapian to reflect changes in
the
  underlying system can be tricky.  If there's no a hook to
tell
  you premissions have changed, all you can really do is
perform a full
  sweep to check periodically and people generally prefer
permission
  changes to be reflected right away.

* A usenet server could allow a search to be restricted to
only those
  articles a user has already read (by filtering based on
article
  numbers from the .newsrc) - this information is very
dynamic and
  different for each user, so it's hard to achieve this
currently.

* Sometimes sites want to be able to quickly remove pages,
including
  from the search (for legal reasons perhaps).  This class
would allow
  entries to be instantly made invisible to searches without
  complicating the standard update process.  At a convenient
point, you
  can drop the entries from the database and remove them
from the
  external filter.

* If your documents are added in date order, you can achieve
"sort by
  date" very cheaply by using BoolWeight and
Enquire::set_docid_order.
  With a simple external map from date to docid, you could
use this
  class to implement a similarly cheap "filter by
date" too.

* If the external source can set weight information, this
new class
  will actually provide a full implementation of the
MatchBiasFunctor
  idea which currently is only present as a
proof-of-concept.  This
  would allow you to add an extra weight to each document -
for example
  in a news search you might want to give a small weight
boost to newer
  articles, or you could add a weight contribution based on
link
  analysis, click-through rates, etc.

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Initial patch for ExternalPostList
user name
2006-06-03 16:56:33
On Sat, Jun 03, 2006 at 05:09:12PM +0100, Olly Betts wrote:

> * Restricting search results to those the user has
permission to see.
> * Sometimes sites want to be able to quickly remove
pages

A common combination of the two is the idea of new pages
being created
first on a staging site, and only later put live. If you're
running a
full revisioned system it gets more complex than this, but I
imagine
this kind of thing will help considerably if anyone, eg,
wanted to
make Xapian available as a search system for something like
Drupal.

J

-- 
/-----------------------------------------------------------
---------------\
  James Aylett                                              
   xapian.org
  jamestartarus.org                              
uncertaintydivision.org

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Initial patch for ExternalPostList
user name
2006-06-03 17:12:27
On Sat, Jun 03, 2006 at 05:56:33PM +0100, James Aylett
wrote:
> On Sat, Jun 03, 2006 at 05:09:12PM +0100, Olly Betts
wrote:
> 
> > * Restricting search results to those the user has
permission to see.
> > * Sometimes sites want to be able to quickly
remove pages
> 
> A common combination of the two is the idea of new
pages being created
> first on a staging site, and only later put live. If
you're running a
> full revisioned system it gets more complex than this,
but I imagine
> this kind of thing will help considerably if anyone,
eg, wanted to
> make Xapian available as a search system for something
like Drupal.

You mean index a page once it's in the system, but not
allow it to
actually appear in the results until it goes live, thus
allowing
apparently instant indexing of new pages but with the speed
benefits
of adding documents in batches?

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Initial patch for ExternalPostList
user name
2006-06-03 19:38:49

> * If the external source can set weight information,
this new class
>   will actually provide a full implementation of the
MatchBiasFunctor
>   idea which currently is only present as a
proof-of-concept.  This
>   would allow you to add an extra weight to each
document - for  
> example
>   in a news search you might want to give a small
weight boost to  
> newer
>   articles, or you could add a weight contribution
based on link
>   analysis, click-through rates, etc.

Hi Olly,

I'm interested in hearing more about how the actual weight
would be  
returned for each document.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com




_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Initial patch for ExternalPostList
user name
2006-06-03 20:01:11
On Sat, Jun 03, 2006 at 01:38:49PM -0600, Rusty Conover
wrote:
> I'm interested in hearing more about how the actual
weight would be
> returned for each document.

ExternalPostingSource would have the 2 methods get_weight()
and
get_maxweight().  In response to a call to get_weight() it
would
return the weight to contribute to the current document
(i.e.
what get_docid() would currently return).

And a call to get_maxweight() would return an upper bound on
what
get_weight() can return for any docid.

So to bias towards newer articles, you might have something
like
this (which is pretty much what the current match bias stuff
is
hardwired to do):

    In the ctor:

	W = some positive weight
	K = some negative constant which affects the decay rate
	now = time(NULL);

    Xapian::weight get_maxweight() const {
	return W;
    }

    Xapian::weight get_weight() const {
	time_t t = get_timestamp(get_docid());
	if (t >= now) {
	    return W;
	}
	t = now - t;
	return W * exp(K * t);
    }

If you only want to use the external source for filtering,
both of these
can just return 0.

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
[1-10]

about | contact  Other archives ( Real Estate discussion Medical topics )