List Info

Thread: Document clustering module?




Document clustering module?
user name
2007-09-16 06:27:34
Hi,

I am implementing some document clustering algorithms in the
xapian
core. I would like to know if this kind of module will be
considered
to be incorporated into the core release. Or is there
already some
document clustering module that is just not open-sourced
yet?

Best,
Yung-chung Lin

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: Document clustering module?
country flaguser name
United Kingdom
2007-09-16 06:53:17
On Sun, Sep 16, 2007 at 07:27:34PM +0800, Yung-chung Lin
wrote:
> I am implementing some document clustering algorithms
in the xapian
> core. I would like to know if this kind of module will
be considered
> to be incorporated into the core release.

Yes - I think it fits with xapian-core's role, so the issues
are things
like scalability, maintainability, API consistency, etc. 
The "HACKING"
document in xapian-core has some tips for contributers.

> Or is there already some document clustering module
that is just not
> open-sourced yet?

Not that I'm aware of.

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: Document clustering module?
user name
2007-09-16 07:26:05
Hi,

The attached file is my current public clustering interface.
I think
it would be easier to have discussions with a header file
present.
My clustering module is intended to cluster documents in
MSets and it
can enhance query expansion, and clustering is totally done
in memory.
I am not sure if clustering on documents in database is
necessary,
since it really involves a huge amount of computation.
In-memory
clustering on retrieved documents is an easier and I think
it is also
useful.

DSet, in the header file, stands for one cluster of
documents and
MultiDSet stands for clusters of documents.

I am using a standalone similarity function
'calculate_doc_similarity()' which is overridable. Then I
don't use
the xapian's weighting schemes to calculate weights. (Partly
because I
have not read through xapian source code yet.) The
similarity measure
is based on vector space model, and API users can simply
create their
own document similarity function  on their own. I am not
sure if this
is an optimal design. Maybe putting the similarity function
into a
class would be even better. It needs discussion.

Now, I am using MultiDSet to store documents. I am thinking
if it
would better if it returns multiple MSets, MultiMset, but
the design
will be different and more complicated.

I have read the coding styles in HACKING, so I believe my
coding style
would be OK. The issues would be on scalability and
maintainability.

Comments are welcome.

Best,
Yung-chung Lin

On 9/16/07, Olly Betts <ollysurvex.com> wrote:
> On Sun, Sep 16, 2007 at 07:27:34PM +0800, Yung-chung
Lin wrote:
> > I am implementing some document clustering
algorithms in the xapian
> > core. I would like to know if this kind of module
will be considered
> > to be incorporated into the core release.
>
> Yes - I think it fits with xapian-core's role, so the
issues are things
> like scalability, maintainability, API consistency,
etc.  The "HACKING"
> document in xapian-core has some tips for
contributers.
>
> > Or is there already some document clustering
module that is just not
> > open-sourced yet?
>
> Not that I'm aware of.
>
> Cheers,
>     Olly
>

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

  
Re: Document clustering module?
country flaguser name
United Kingdom
2007-09-16 10:17:01
[There's no need to Cc: me on list replies]

On Sun, Sep 16, 2007 at 08:26:05PM +0800, Yung-chung Lin
wrote:
> The attached file is my current public clustering
interface. I think
> it would be easier to have discussions with a header
file present.

Good idea.

> DSet, in the header file, stands for one cluster of
documents and
> MultiDSet stands for clusters of documents.

Returning a vector of vectors by value seems suboptimal.

Simply using typedef of a vector is problematic too -
existing Xapian
classes are either reference counted handles, or have very
few members,
so users can expect that copying them is cheap.

> I am using a standalone similarity function
> 'calculate_doc_similarity()' which is overridable.

Unfortunately, you can't usefully put virtual functions on
classes which
use RefCntPtr - if you subclass, you're only subclassing the
"pointer"
bit, so Xapian won't be able to call back to the overridden
method.
Bug#186 is relevant (I had some further thoughts about how
we could
address this but I don't have a full solution yet):

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id
=186

> Maybe putting the similarity function into a class
would be even
> better. It needs discussion.

I think that is probably the answer.

> Now, I am using MultiDSet to store documents. I am
thinking if it
> would better if it returns multiple MSets, MultiMset,
but the design
> will be different and more complicated.

I think I need to mull over how this would all be used. 
Reusing MSet
would be nice if it's a good fit, since adding more API
classes tends to
make it harder to learn the API, so it's good if it can be
avoided.  But
forcing reuse where something isn't a natural fit would be
worse.

> #include <xapian/base.h>
> #include <xapian/deprecated.h>
> #include <xapian/enquire.h>
> #include <xapian/types.h>
> #include <xapian/database.h>
> #include <xapian/document.h>
> #include <xapian/visibility.h>

You don't use deprecated.h here.

And I don't think you need database.h or enquire.h - you can
just
forward declare "class Database;" and "class
MSet;" inside the
namespace.

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: Document clustering module?
user name
2007-09-16 10:52:20
> > Maybe putting the similarity function into a class
would be even
> > better. It needs discussion.
>
> I think that is probably the answer.

And what is your opinion of using Xapian::Weight to
calculate document
similarity?
I have not read through the code yet, but I just think they
seem heavy
in this use.

>
> > Now, I am using MultiDSet to store documents. I am
thinking if it
> > would better if it returns multiple MSets,
MultiMset, but the design
> > will be different and more complicated.
>
> I think I need to mull over how this would all be used.
 Reusing MSet
> would be nice if it's a good fit, since adding more API
classes tends to
> make it harder to learn the API, so it's good if it can
be avoided.  But
> forcing reuse where something isn't a natural fit would
be worse.
>

I just gave it a thought and my simple and non-intrusive
idea is to
specify clustering algorithm when using Xapian::Enquire and
to
associate each MSetItem with a cluster id, which would
resemble:

  Enquire enq;
  ClusterSingleLinkage cluster_algorithm;
  enq.set_clustering_method(cluster_algorithm);
  MSet matches = enq.get_mset(1, 10);
  cout << matches.get_cluster_count() << endl;
  for (MSetIterator miter = matches.begin(); miter !=
matches.end(); ++miter) {
      cout << "Document " << *miter
<< " is in cluster "
              << miter->get_cluster_id() <<
endl;
  }

And let API users do what they want to do with the
clusters.

Best,
Yung-chung Lin

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: Document clustering module?
country flaguser name
United Kingdom
2007-09-16 13:13:14
On Sun, Sep 16, 2007 at 11:52:20PM +0800, Yung-chung Lin
wrote:
> And what is your opinion of using Xapian::Weight to
calculate document
> similarity?

Xapian::Weight is set up to score a single document by
adding scores
from a set of terms (plus an optional contribution which
depends only on
the document length), whereas here we want a score from a
pair of
documents.  So I think you'd have to convert one of the
documents to a
list of all the terms in it, which seems artificial.

And it seems legitimate to allow clustering using document
values (e.g.
you might store geographical coordinates in a document value
and cluster
by location), which doesn't fit with Xapian::Weight.

So I think a class which provides a similarity measure given
two
Xapian:ocument
objects is probably the answer.

> > > Now, I am using MultiDSet to store documents.
I am thinking if it
> > > would better if it returns multiple MSets,
MultiMset, but the design
> > > will be different and more complicated.
> >
> > I think I need to mull over how this would all be
used.  Reusing MSet
> > would be nice if it's a good fit, since adding
more API classes tends to
> > make it harder to learn the API, so it's good if
it can be avoided.  But
> > forcing reuse where something isn't a natural fit
would be worse.
> 
> I just gave it a thought and my simple and
non-intrusive idea is to
> specify clustering algorithm when using Xapian::Enquire
and to
> associate each MSetItem with a cluster id, which would
resemble:
> 
>   Enquire enq;
>   ClusterSingleLinkage cluster_algorithm;
>   enq.set_clustering_method(cluster_algorithm);
>   MSet matches = enq.get_mset(1, 10);
>   cout << matches.get_cluster_count() <<
endl;
>   for (MSetIterator miter = matches.begin(); miter !=
matches.end(); ++miter) {
>       cout << "Document " <<
*miter << " is in cluster "
>               << miter->get_cluster_id()
<< endl;
>   }
> 
> And let API users do what they want to do with the
clusters.

Yes, that seems a very nice approach.  It also more
naturally allows the
possibility of using document similarity to eliminate
near-duplicates -
to do that efficiently you want to do it as matches are
generated so
that you can stop when you have enough in the MSet.

It wouldn't allow generating of different clusters of the
same results
(without rerunning the search) but that doesn't seem like
it's likely to
be an annoying limitation.

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: Document clustering module?
user name
2007-09-16 20:31:23
> > I just gave it a thought and my simple and
non-intrusive idea is to
> > specify clustering algorithm when using
Xapian::Enquire and to
> > associate each MSetItem with a cluster id, which
would resemble:
> >
> >   Enquire enq;
> >   ClusterSingleLinkage cluster_algorithm;
> >   enq.set_clustering_method(cluster_algorithm);
> >   MSet matches = enq.get_mset(1, 10);
> >   cout << matches.get_cluster_count()
<< endl;
> >   for (MSetIterator miter = matches.begin(); miter
!= matches.end(); ++miter) {
> >       cout << "Document " <<
*miter << " is in cluster "
> >               << miter->get_cluster_id()
<< endl;
> >   }
> >
> > And let API users do what they want to do with the
clusters.
>
> Yes, that seems a very nice approach.  It also more
naturally allows the
> possibility of using document similarity to eliminate
near-duplicates -
> to do that efficiently you want to do it as matches are
generated so
> that you can stop when you have enough in the MSet.
>
> It wouldn't allow generating of different clusters of
the same results
> (without rerunning the search) but that doesn't seem
like it's likely to
> be an annoying limitation.

Calling cluster_algorithm.cluster_mset(matches) manually
may
re-cluster matches and you can also choose another
clustering
algorithm. What about this?

Best,
Yung-chung Lin

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: Document clustering module?
user name
2007-09-17 05:38:25
Then I think the interface can become like this:

    // Cluster documents by document value 1
    matches.group_by_value(1);

    // Iterate through clusters and mset items in each
cluster.
    for (ClusterIterator citer = matches.clusters_begin();
           citer != matches.clusters_end(); ++citer) {
        // get_cluster_id() returns the internal cluster
index
        cout << citer->get_cluster_id() <<
endl;

        // Cluster's ID (or index) is just an unsigned
integer.
        // Cluster's ID (or index) and mset item's index can
be simply stored in
        // std::vector<std::vector> or
std::vector<std::map>

        for (MSetIterator miter = citer->mset_begin();
               miter != citer->mset_end(); ++miter) {
            // Using miter->get_cluster_id() here returns
the same.
            cout << "Doc " << *miter
<< " is in cluster "
                    << citer->get_cluster_id()
<< endl;
        }
    }

I believe cluster name can be added in the core easily if
there is a
need. The access method can be like this:

   
citer->set_cluster_name("some_mysterious_cluster&quo
t;)
    citer->get_cluster_name();

Best,
Yung-chung Lin

On 9/17/07, Richard Boulton <richardlemurconsulting.com> wrote:
> Olly Betts wrote:
> >>   Enquire enq;
> >>   ClusterSingleLinkage cluster_algorithm;
> >>  
enq.set_clustering_method(cluster_algorithm);
> >>   MSet matches = enq.get_mset(1, 10);
> >>   cout << matches.get_cluster_count()
<< endl;
> >>   for (MSetIterator miter = matches.begin();
miter != matches.end(); ++miter) {
> >>       cout << "Document "
<< *miter << " is in cluster "
> >>               <<
miter->get_cluster_id() << endl;
> >>   }
> >>
> >> And let API users do what they want to do with
the clusters.
> >
> > Yes, that seems a very nice approach.  It also
more naturally allows the
> > possibility of using document similarity to
eliminate near-duplicates -
> > to do that efficiently you want to do it as
matches are generated so
> > that you can stop when you have enough in the
MSet.
> >
> > It wouldn't allow generating of different clusters
of the same results
> > (without rerunning the search) but that doesn't
seem like it's likely to
> > be an annoying limitation.
>
> We've also had the idea of extending the collapse
mechanism to group by
> a value (instead of just returning the top document in
a collapse group,
> as it currently does).  This kind of interface would
allow that to be
> represented, too.
>
> There would need to be some way to get a list of the
cluster ids
> allocated for a given mset, and probably also a way to
get further
> information on a cluster - some clustering algorithms
allow a name to be
> assigned to a cluster, so we should be able to provide
that, (and if we
> were performing a "group by value" operation
instead of a cluster, the
> value for each group should be available).
>
> --
> Richard
>

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: Document clustering module?
country flaguser name
United Kingdom
2007-09-17 07:06:12
On Mon, Sep 17, 2007 at 09:31:23AM +0800, ??? ????????? ???
(Yung-chung Lin) wrote:
> Calling cluster_algorithm.cluster_mset(matches)
manually may
> re-cluster matches and you can also choose another
clustering
> algorithm. What about this?

I don't think it really matters.

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: Document clustering module?
country flaguser name
United Kingdom
2007-09-17 07:33:20
On Mon, Sep 17, 2007 at 06:38:25PM +0800, ??? ????????? ???
(Yung-chung Lin) wrote:
> Then I think the interface can become like this:
> 
>     // Cluster documents by document value 1
>     matches.group_by_value(1);

If you're talking about grouping collapsed documents, that
should
probably happen during the match process, like collapse
does.  Don't
worry too much about that idea - let's focus on the
clustering part
for now, and just bear in mind how it might be reused for
this (or
perhaps this problem is too different).

If you're not talking about that, there needs to be a
clustering
algorithm specified for this to work.

>     // Iterate through clusters and mset items in each
cluster.
>     for (ClusterIterator citer =
matches.clusters_begin();
>            citer != matches.clusters_end(); ++citer) {
>         // get_cluster_id() returns the internal
cluster index
>         cout << citer->get_cluster_id()
<< endl;
> 
>         // Cluster's ID (or index) is just an unsigned
integer.
>         // Cluster's ID (or index) and mset item's
index can be simply stored in
>         // std::vector<std::vector> or
std::vector<std::map>
> 
>         for (MSetIterator miter =
citer->mset_begin();
>                miter != citer->mset_end(); ++miter)
{
>             // Using miter->get_cluster_id() here
returns the same.
>             cout << "Doc " <<
*miter << " is in cluster "
>                     << citer->get_cluster_id()
<< endl;
>         }
>     }

I wouldn't get too fancy initially - we don't want to
produce an
elaborate API which we think does everything conceivable,
only to
discover a better approach or something it can't nicely do,
and then
have to choose between keeping the sub-optimal API we have,
or the pain
of deprecation and transition.  

Let's just go with tagging each MSet entry with a cluster id
for now.
That seems a good starting point, and everything which has
been
suggested so far can either be built on top of that, or
provide that as
a side-effect.

And that should allow us to get clustering functionality
into a release
sooner.

Cheers,
    Olly

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

[1-10] [11-13]

about | contact  Other archives ( Real Estate discussion Medical topics )