List Info

Thread: Re: Document clustering module?




Re: Document clustering module?
country flaguser name
United Kingdom
2007-09-17 04:32:08
Olly Betts wrote:
>>   Enquire enq;
>>   ClusterSingleLinkage cluster_algorithm;
>>   enq.set_clustering_method(cluster_algorithm);
>>   MSet matches = enq.get_mset(1, 10);
>>   cout << matches.get_cluster_count()
<< endl;
>>   for (MSetIterator miter = matches.begin(); miter
!= matches.end(); ++miter) {
>>       cout << "Document " <<
*miter << " is in cluster "
>>               << miter->get_cluster_id()
<< endl;
>>   }
>>
>> And let API users do what they want to do with the
clusters.
> 
> Yes, that seems a very nice approach.  It also more
naturally allows the
> possibility of using document similarity to eliminate
near-duplicates -
> to do that efficiently you want to do it as matches are
generated so
> that you can stop when you have enough in the MSet.
> 
> It wouldn't allow generating of different clusters of
the same results
> (without rerunning the search) but that doesn't seem
like it's likely to
> be an annoying limitation.

We've also had the idea of extending the collapse mechanism
to group by 
a value (instead of just returning the top document in a
collapse group, 
as it currently does).  This kind of interface would allow
that to be 
represented, too.

There would need to be some way to get a list of the cluster
ids 
allocated for a given mset, and probably also a way to get
further 
information on a cluster - some clustering algorithms allow
a name to be 
assigned to a cluster, so we should be able to provide that,
(and if we 
were performing a "group by value" operation
instead of a cluster, the 
value for each group should be available).

-- 
Richard

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Re: Document clustering module?
country flaguser name
United Kingdom
2007-09-17 04:37:54
Richard Boulton wrote:
> There would need to be some way to get a list of the
cluster ids 
> allocated for a given mset

I meant to say - it would be useful if there was a way to
get this; you 
could always iterate through the whole mset to get this
list, of course.

-- 
Richard

_______________________________________________
Xapian-devel mailing list
Xapian-devellists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )