|
List Info
Thread: Created: (LUCENE-1025) Document clusterer
|
|
| Created: (LUCENE-1025) Document
clusterer |
  United States |
2007-10-09 20:44:50 |
Document clusterer
------------------
Key: LUCENE-1025
URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1025
Project: Lucene - Java
Issue Type: New Feature
Components: Analysis, Term Vectors
Reporter: Karl Wettin
Priority: Minor
Attachments: LUCENE-1025.txt
A two-dimensional desicion tree in conjunction with cosine
coefficient similarity is the base of this document
clusterer. It uses Lucene for tokenization and length
normalization.
Example output of 3500 clustered news articles dated the
thee first days of January 2004 from a number of sources can
be found here: < http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt
a> >. One thing missing is automatic calculation of
cluster boundaries. Not impossible to implement, nor is it
really needed. 4.5 in the URL above is that distance.
The example was calculated limited to the top 1000 terms
from instance, divided with siblings and re-pruned all the
way to the root. On my dual core it took about 100ms to
insert a new document in the tree, no matter if it contained
100 or 10,000 instances. 1GB RAM held about 10,000 news
articles.
Next steps for this code is persistency of the tree using
BDB or a even perhaps something similar to the Lucene
segmented solution. Perhaps even using Lucene Directory. The
plan is to keep this clusterer synchronized with the index,
allowing really speedy "more like this" features.
Later on I'll introduce map/reduce for better training
speed.
This code is far from perfect, nor is the results as good as
many other products. Knowing I didn't put in more than a
few handful of hours, this works quite well.
By displaying neighboring clusters (as in the example) one
will definetly get more related documents at a fairly low
false-positive cost. Perhaps it would be interesting to
analyse user behavior to find out if any of them could be
merged. Perhaps some reinforcement learning?
There are no ROC-curves, precision/recall-values nor
tp/fp-rates as I have no manually clustered corpus for me to
compare with.
I've been looking for an archive of the Lucene-users forum
for demonstrational use, but could not find it. Any ideas on
where I can find that? It could for instance be neat to
tweak this code to identify frequently asked questions and
match it with an answer in the Wiki, but perhaps an SVM, NB
or something-implementation would be better suited for
that.
Don't hesitate to comment on this if you have an idea,
request or question.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Updated: (LUCENE-1025) Document
clusterer |
  United States |
2007-10-10 04:20:51 |
[ https://issues.apache.org/jira/browse/LUCENE-1025?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
a> ]
Karl Wettin updated LUCENE-1025:
--------------------------------
Attachment: LUCENE-1025.txt
Major bug in Markov Chain Token Filter detected in the port
from java 1.5 > 1.4
The clustering output shouw be noticable better now. I'll
update the linked example output.
> Document clusterer
> ------------------
>
> Key: LUCENE-1025
> URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1025
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Analysis, Term Vectors
> Reporter: Karl Wettin
> Priority: Minor
> Attachments: LUCENE-1025.txt
>
>
> A two-dimensional desicion tree in conjunction with
cosine coefficient similarity is the base of this document
clusterer. It uses Lucene for tokenization and length
normalization.
> Example output of 3500 clustered news articles dated
the thee first days of January 2004 from a number of sources
can be found here: < http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt
a> >. One thing missing is automatic calculation of
cluster boundaries. Not impossible to implement, nor is it
really needed. 4.5 in the URL above is that distance.
> The example was calculated limited to the top 1000
terms from instance, divided with siblings and re-pruned all
the way to the root. On my dual core it took about 100ms to
insert a new document in the tree, no matter if it contained
100 or 10,000 instances. 1GB RAM held about 10,000 news
articles.
> Next steps for this code is persistency of the tree
using BDB or a even perhaps something similar to the Lucene
segmented solution. Perhaps even using Lucene Directory. The
plan is to keep this clusterer synchronized with the index,
allowing really speedy "more like this" features.
> Later on I'll introduce map/reduce for better training
speed.
> This code is far from perfect, nor is the results as
good as many other products. Knowing I didn't put in more
than a few handful of hours, this works quite well.
> By displaying neighboring clusters (as in the example)
one will definetly get more related documents at a fairly
low false-positive cost. Perhaps it would be interesting to
analyse user behavior to find out if any of them could be
merged. Perhaps some reinforcement learning?
> There are no ROC-curves, precision/recall-values nor
tp/fp-rates as I have no manually clustered corpus for me to
compare with.
> I've been looking for an archive of the Lucene-users
forum for demonstrational use, but could not find it. Any
ideas on where I can find that? It could for instance be
neat to tweak this code to identify frequently asked
questions and match it with an answer in the Wiki, but
perhaps an SVM, NB or something-implementation would be
better suited for that.
> Don't hesitate to comment on this if you have an idea,
request or question.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Updated: (LUCENE-1025) Document
clusterer |
  United States |
2007-10-10 04:20:51 |
[ https://issues.apache.org/jira/browse/LUCENE-1025?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
a> ]
Karl Wettin updated LUCENE-1025:
--------------------------------
Attachment: (was: LUCENE-1025.txt)
> Document clusterer
> ------------------
>
> Key: LUCENE-1025
> URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1025
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Analysis, Term Vectors
> Reporter: Karl Wettin
> Priority: Minor
> Attachments: LUCENE-1025.txt
>
>
> A two-dimensional desicion tree in conjunction with
cosine coefficient similarity is the base of this document
clusterer. It uses Lucene for tokenization and length
normalization.
> Example output of 3500 clustered news articles dated
the thee first days of January 2004 from a number of sources
can be found here: < http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt
a> >. One thing missing is automatic calculation of
cluster boundaries. Not impossible to implement, nor is it
really needed. 4.5 in the URL above is that distance.
> The example was calculated limited to the top 1000
terms from instance, divided with siblings and re-pruned all
the way to the root. On my dual core it took about 100ms to
insert a new document in the tree, no matter if it contained
100 or 10,000 instances. 1GB RAM held about 10,000 news
articles.
> Next steps for this code is persistency of the tree
using BDB or a even perhaps something similar to the Lucene
segmented solution. Perhaps even using Lucene Directory. The
plan is to keep this clusterer synchronized with the index,
allowing really speedy "more like this" features.
> Later on I'll introduce map/reduce for better training
speed.
> This code is far from perfect, nor is the results as
good as many other products. Knowing I didn't put in more
than a few handful of hours, this works quite well.
> By displaying neighboring clusters (as in the example)
one will definetly get more related documents at a fairly
low false-positive cost. Perhaps it would be interesting to
analyse user behavior to find out if any of them could be
merged. Perhaps some reinforcement learning?
> There are no ROC-curves, precision/recall-values nor
tp/fp-rates as I have no manually clustered corpus for me to
compare with.
> I've been looking for an archive of the Lucene-users
forum for demonstrational use, but could not find it. Any
ideas on where I can find that? It could for instance be
neat to tweak this code to identify frequently asked
questions and match it with an answer in the Wiki, but
perhaps an SVM, NB or something-implementation would be
better suited for that.
> Don't hesitate to comment on this if you have an idea,
request or question.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Commented: (LUCENE-1025) Document
clusterer |
  United States |
2007-10-10 05:00:56 |
[ https://issues.apache.org/jira/browse
/LUCENE-1025?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12533689 ]
Karl Wettin commented on LUCENE-1025:
-------------------------------------
There is now new example output here: http://g
inandtonique.org/~kalle/LUCENE-1025/
I recommend out_5.5.txt, but what number best demonstrate
the clusterer will change as the tokenization and similarity
alogrithm chages.
> Document clusterer
> ------------------
>
> Key: LUCENE-1025
> URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1025
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Analysis, Term Vectors
> Reporter: Karl Wettin
> Priority: Minor
> Attachments: LUCENE-1025.txt
>
>
> A two-dimensional desicion tree in conjunction with
cosine coefficient similarity is the base of this document
clusterer. It uses Lucene for tokenization and length
normalization.
> Example output of 3500 clustered news articles dated
the thee first days of January 2004 from a number of sources
can be found here: < http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt
a> >. One thing missing is automatic calculation of
cluster boundaries. Not impossible to implement, nor is it
really needed. 4.5 in the URL above is that distance.
> The example was calculated limited to the top 1000
terms from instance, divided with siblings and re-pruned all
the way to the root. On my dual core it took about 100ms to
insert a new document in the tree, no matter if it contained
100 or 10,000 instances. 1GB RAM held about 10,000 news
articles.
> Next steps for this code is persistency of the tree
using BDB or a even perhaps something similar to the Lucene
segmented solution. Perhaps even using Lucene Directory. The
plan is to keep this clusterer synchronized with the index,
allowing really speedy "more like this" features.
> Later on I'll introduce map/reduce for better training
speed.
> This code is far from perfect, nor is the results as
good as many other products. Knowing I didn't put in more
than a few handful of hours, this works quite well.
> By displaying neighboring clusters (as in the example)
one will definetly get more related documents at a fairly
low false-positive cost. Perhaps it would be interesting to
analyse user behavior to find out if any of them could be
merged. Perhaps some reinforcement learning?
> There are no ROC-curves, precision/recall-values nor
tp/fp-rates as I have no manually clustered corpus for me to
compare with.
> I've been looking for an archive of the Lucene-users
forum for demonstrational use, but could not find it. Any
ideas on where I can find that? It could for instance be
neat to tweak this code to identify frequently asked
questions and match it with an answer in the Wiki, but
perhaps an SVM, NB or something-implementation would be
better suited for that.
> Don't hesitate to comment on this if you have an idea,
request or question.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Commented: (LUCENE-1025) Document
clusterer |
  United States |
2007-10-11 15:20:50 |
[ https://issues.apache.org/jira/browse
/LUCENE-1025?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12534145 ]
Karl Wettin commented on LUCENE-1025:
-------------------------------------
> I've been looking for an archive of the Lucene-users
forum for
> demonstrational use, but could not find it. Any ideas
on where I can find that?
Available here: http://lucene.apache.or
g/mail
> Document clusterer
> ------------------
>
> Key: LUCENE-1025
> URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1025
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Analysis, Term Vectors
> Reporter: Karl Wettin
> Priority: Minor
> Attachments: LUCENE-1025.txt
>
>
> A two-dimensional desicion tree in conjunction with
cosine coefficient similarity is the base of this document
clusterer. It uses Lucene for tokenization and length
normalization.
> Example output of 3500 clustered news articles dated
the thee first days of January 2004 from a number of sources
can be found here: < http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt
a> >. One thing missing is automatic calculation of
cluster boundaries. Not impossible to implement, nor is it
really needed. 4.5 in the URL above is that distance.
> The example was calculated limited to the top 1000
terms from instance, divided with siblings and re-pruned all
the way to the root. On my dual core it took about 100ms to
insert a new document in the tree, no matter if it contained
100 or 10,000 instances. 1GB RAM held about 10,000 news
articles.
> Next steps for this code is persistency of the tree
using BDB or a even perhaps something similar to the Lucene
segmented solution. Perhaps even using Lucene Directory. The
plan is to keep this clusterer synchronized with the index,
allowing really speedy "more like this" features.
> Later on I'll introduce map/reduce for better training
speed.
> This code is far from perfect, nor is the results as
good as many other products. Knowing I didn't put in more
than a few handful of hours, this works quite well.
> By displaying neighboring clusters (as in the example)
one will definetly get more related documents at a fairly
low false-positive cost. Perhaps it would be interesting to
analyse user behavior to find out if any of them could be
merged. Perhaps some reinforcement learning?
> There are no ROC-curves, precision/recall-values nor
tp/fp-rates as I have no manually clustered corpus for me to
compare with.
> I've been looking for an archive of the Lucene-users
forum for demonstrational use, but could not find it. Any
ideas on where I can find that? It could for instance be
neat to tweak this code to identify frequently asked
questions and match it with an answer in the Wiki, but
perhaps an SVM, NB or something-implementation would be
better suited for that.
> Don't hesitate to comment on this if you have an idea,
request or question.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Updated: (LUCENE-1025) Document
clusterer |
  United States |
2007-10-14 21:56:51 |
[ https://issues.apache.org/jira/browse/LUCENE-1025?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
a> ]
Karl Wettin updated LUCENE-1025:
--------------------------------
Attachment: LUCENE-1025.txt
Introduced a halfbaked Markov chain (not to be confused with
the token filter in this patch with a similar name that
concatenate tokens) in order to to determine a mean title
from all instances in a cluster. Sort of works like the
MegaHAL, the talking bot, but usually makes more sense as
all titles in a cluster are similar. It needs work with
limiting length, it should look further ahead than one link
and it is terrible unoptimized. Still, it already behaves
quite well with the news article corpus I test with.
Perhaps it would make sense to rather select the title of
the most central instance in a cluster, but that is not as
fun.
> Document clusterer
> ------------------
>
> Key: LUCENE-1025
> URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1025
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Analysis, Term Vectors
> Reporter: Karl Wettin
> Priority: Minor
> Attachments: LUCENE-1025.txt, LUCENE-1025.txt
>
>
> A two-dimensional desicion tree in conjunction with
cosine coefficient similarity is the base of this document
clusterer. It uses Lucene for tokenization and length
normalization.
> Example output of 3500 clustered news articles dated
the thee first days of January 2004 from a number of sources
can be found here: < http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt
a> >. One thing missing is automatic calculation of
cluster boundaries. Not impossible to implement, nor is it
really needed. 4.5 in the URL above is that distance.
> The example was calculated limited to the top 1000
terms from instance, divided with siblings and re-pruned all
the way to the root. On my dual core it took about 100ms to
insert a new document in the tree, no matter if it contained
100 or 10,000 instances. 1GB RAM held about 10,000 news
articles.
> Next steps for this code is persistency of the tree
using BDB or a even perhaps something similar to the Lucene
segmented solution. Perhaps even using Lucene Directory. The
plan is to keep this clusterer synchronized with the index,
allowing really speedy "more like this" features.
> Later on I'll introduce map/reduce for better training
speed.
> This code is far from perfect, nor is the results as
good as many other products. Knowing I didn't put in more
than a few handful of hours, this works quite well.
> By displaying neighboring clusters (as in the example)
one will definetly get more related documents at a fairly
low false-positive cost. Perhaps it would be interesting to
analyse user behavior to find out if any of them could be
merged. Perhaps some reinforcement learning?
> There are no ROC-curves, precision/recall-values nor
tp/fp-rates as I have no manually clustered corpus for me to
compare with.
> I've been looking for an archive of the Lucene-users
forum for demonstrational use, but could not find it. Any
ideas on where I can find that? It could for instance be
neat to tweak this code to identify frequently asked
questions and match it with an answer in the Wiki, but
perhaps an SVM, NB or something-implementation would be
better suited for that.
> Don't hesitate to comment on this if you have an idea,
request or question.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
[1-6]
|
|