List Info

Thread: Created: (LUCENE-626) Adaptive, user query session analyzing spell checker.




Commented: (LUCENE-626) Adaptive, user query session analyzing spell checker.
user name
2007-01-30 17:40:34
    [ https://issues.apache.org/jira/browse/
LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpan
els:comment-tabpanel#action_12468825 ] 

Karl Wettin commented on LUCENE-626:
------------------------------------




I've been running this version live today. Suggests great
stuff all  
the time. It is however a bit RAM hogging just as everything
else I  
do. Think I'll add some sort of external persistency to
handle that  
(probably BDB), backed by a soft referenced cache.

There is a problem with the adaptive layer not adapting to
(correct)  
suggestions with large edit distance supplied by the multi
word/term  
position vector layer on top of the ngram spell checker.
E.g. "magic  
might heros" -> "heroes might magic". 


> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> From javadocs:
>  This is an adaptive, user query session analyzing
spell checker. In plain words, a word and phrase dictionary
that will learn from how users act while searching.
> Be aware, this is a beta version. It is not finished,
but yeilds great results if you have enough user activity,
RAM and a faily narrow document corpus. The RAM problem can
be fixed if you implement your own subclass of SpellChecker
as the abstract methods of this class are the CRUD methods.
This will most probably change to a strategy class in future
version.
> TODO:
> 1. Gram up results to detect compositewords that should
not be composite words, and vice verse.
> 2. Train a gramed token (markov) chain with output from
an expectation maximization algorithm (weka clusters?)
parallel to a closest path (A* or bredth first?) to allow
contextual suggestions on queries that never was placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults
containg the query string, number of hits and a time stamp.
Add it to a chronologically ordered list in the user session
(LinkedList makes sense) that you pass on to
train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every
100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results.
Don't modify it! This method call will be hidden in a facade
in future version.
> Note that the spell checker is case sensitive, so you
want to clean up query the same way when you train as when
you request the suggestions.
> I recommend something like query =
query.toLowerCase().replaceAll(" ", "
").trim() 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-626) Adaptive, user query session analyzing spell checker.
user name
2007-02-02 19:47:06
     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

    Comment: was deleted

> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> From javadocs:
>  This is an adaptive, user query session analyzing
spell checker. In plain words, a word and phrase dictionary
that will learn from how users act while searching.
> Be aware, this is a beta version. It is not finished,
but yeilds great results if you have enough user activity,
RAM and a faily narrow document corpus. The RAM problem can
be fixed if you implement your own subclass of SpellChecker
as the abstract methods of this class are the CRUD methods.
This will most probably change to a strategy class in future
version.
> TODO:
> 1. Gram up results to detect compositewords that should
not be composite words, and vice verse.
> 2. Train a gramed token (markov) chain with output from
an expectation maximization algorithm (weka clusters?)
parallel to a closest path (A* or bredth first?) to allow
contextual suggestions on queries that never was placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults
containg the query string, number of hits and a time stamp.
Add it to a chronologically ordered list in the user session
(LinkedList makes sense) that you pass on to
train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every
100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results.
Don't modify it! This method call will be hidden in a facade
in future version.
> Note that the spell checker is case sensitive, so you
want to clean up query the same way when you train as when
you request the suggestions.
> I recommend something like query =
query.toLowerCase().replaceAll(" ", "
").trim() 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-626) Adaptive, user query session analyzing spell checker.
user name
2007-02-02 19:47:07
     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

    Comment: was deleted

> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> From javadocs:
>  This is an adaptive, user query session analyzing
spell checker. In plain words, a word and phrase dictionary
that will learn from how users act while searching.
> Be aware, this is a beta version. It is not finished,
but yeilds great results if you have enough user activity,
RAM and a faily narrow document corpus. The RAM problem can
be fixed if you implement your own subclass of SpellChecker
as the abstract methods of this class are the CRUD methods.
This will most probably change to a strategy class in future
version.
> TODO:
> 1. Gram up results to detect compositewords that should
not be composite words, and vice verse.
> 2. Train a gramed token (markov) chain with output from
an expectation maximization algorithm (weka clusters?)
parallel to a closest path (A* or bredth first?) to allow
contextual suggestions on queries that never was placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults
containg the query string, number of hits and a time stamp.
Add it to a chronologically ordered list in the user session
(LinkedList makes sense) that you pass on to
train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every
100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results.
Don't modify it! This method call will be hidden in a facade
in future version.
> Note that the spell checker is case sensitive, so you
want to clean up query the same way when you train as when
you request the suggestions.
> I recommend something like query =
query.toLowerCase().replaceAll(" ", "
").trim() 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-626) Adaptive, user query session analyzing spell checker.
user name
2007-02-02 19:47:06
     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

    Comment: was deleted

> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> From javadocs:
>  This is an adaptive, user query session analyzing
spell checker. In plain words, a word and phrase dictionary
that will learn from how users act while searching.
> Be aware, this is a beta version. It is not finished,
but yeilds great results if you have enough user activity,
RAM and a faily narrow document corpus. The RAM problem can
be fixed if you implement your own subclass of SpellChecker
as the abstract methods of this class are the CRUD methods.
This will most probably change to a strategy class in future
version.
> TODO:
> 1. Gram up results to detect compositewords that should
not be composite words, and vice verse.
> 2. Train a gramed token (markov) chain with output from
an expectation maximization algorithm (weka clusters?)
parallel to a closest path (A* or bredth first?) to allow
contextual suggestions on queries that never was placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults
containg the query string, number of hits and a time stamp.
Add it to a chronologically ordered list in the user session
(LinkedList makes sense) that you pass on to
train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every
100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results.
Don't modify it! This method call will be hidden in a facade
in future version.
> Note that the spell checker is case sensitive, so you
want to clean up query the same way when you train as when
you request the suggestions.
> I recommend something like query =
query.toLowerCase().replaceAll(" ", "
").trim() 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-626) Adaptive, user query session analyzing spell checker.
user name
2007-02-02 19:47:06
     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

    Comment: was deleted

> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> From javadocs:
>  This is an adaptive, user query session analyzing
spell checker. In plain words, a word and phrase dictionary
that will learn from how users act while searching.
> Be aware, this is a beta version. It is not finished,
but yeilds great results if you have enough user activity,
RAM and a faily narrow document corpus. The RAM problem can
be fixed if you implement your own subclass of SpellChecker
as the abstract methods of this class are the CRUD methods.
This will most probably change to a strategy class in future
version.
> TODO:
> 1. Gram up results to detect compositewords that should
not be composite words, and vice verse.
> 2. Train a gramed token (markov) chain with output from
an expectation maximization algorithm (weka clusters?)
parallel to a closest path (A* or bredth first?) to allow
contextual suggestions on queries that never was placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults
containg the query string, number of hits and a time stamp.
Add it to a chronologically ordered list in the user session
(LinkedList makes sense) that you pass on to
train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every
100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results.
Don't modify it! This method call will be hidden in a facade
in future version.
> Note that the spell checker is case sensitive, so you
want to clean up query the same way when you train as when
you request the suggestions.
> I recommend something like query =
query.toLowerCase().replaceAll(" ", "
").trim() 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-626) Adaptive, user query session analyzing spell checker.
user name
2007-02-02 19:47:06
     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

    Comment: was deleted

> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> From javadocs:
>  This is an adaptive, user query session analyzing
spell checker. In plain words, a word and phrase dictionary
that will learn from how users act while searching.
> Be aware, this is a beta version. It is not finished,
but yeilds great results if you have enough user activity,
RAM and a faily narrow document corpus. The RAM problem can
be fixed if you implement your own subclass of SpellChecker
as the abstract methods of this class are the CRUD methods.
This will most probably change to a strategy class in future
version.
> TODO:
> 1. Gram up results to detect compositewords that should
not be composite words, and vice verse.
> 2. Train a gramed token (markov) chain with output from
an expectation maximization algorithm (weka clusters?)
parallel to a closest path (A* or bredth first?) to allow
contextual suggestions on queries that never was placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults
containg the query string, number of hits and a time stamp.
Add it to a chronologically ordered list in the user session
(LinkedList makes sense) that you pass on to
train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every
100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results.
Don't modify it! This method call will be hidden in a facade
in future version.
> Note that the spell checker is case sensitive, so you
want to clean up query the same way when you train as when
you request the suggestions.
> I recommend something like query =
query.toLowerCase().replaceAll(" ", "
").trim() 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-626) Adaptive, user query session analyzing spell checker.
user name
2007-02-02 19:47:07
     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

    Comment: was deleted

> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> From javadocs:
>  This is an adaptive, user query session analyzing
spell checker. In plain words, a word and phrase dictionary
that will learn from how users act while searching.
> Be aware, this is a beta version. It is not finished,
but yeilds great results if you have enough user activity,
RAM and a faily narrow document corpus. The RAM problem can
be fixed if you implement your own subclass of SpellChecker
as the abstract methods of this class are the CRUD methods.
This will most probably change to a strategy class in future
version.
> TODO:
> 1. Gram up results to detect compositewords that should
not be composite words, and vice verse.
> 2. Train a gramed token (markov) chain with output from
an expectation maximization algorithm (weka clusters?)
parallel to a closest path (A* or bredth first?) to allow
contextual suggestions on queries that never was placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults
containg the query string, number of hits and a time stamp.
Add it to a chronologically ordered list in the user session
(LinkedList makes sense) that you pass on to
train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every
100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results.
Don't modify it! This method call will be hidden in a facade
in future version.
> Note that the spell checker is case sensitive, so you
want to clean up query the same way when you train as when
you request the suggestions.
> I recommend something like query =
query.toLowerCase().replaceAll(" ", "
").trim() 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-626) Extended spell checker with phrase support and adaptive user session an
user name
2007-02-02 20:19:05
     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

      Description: 
Some minor changes to how the single token ngram spell
checker in contrib/spellcheck, but nothing that breaks any
old implementation I think. Also fixed the broken test.

NgramPhraseSuggestier tokenizes a query and suggests
combinations of the single token suggestions matrix.

They must match as a query against an apriori index. By
using a span near query (default) you get features like
this:

    assertEquals("lost in translation",
ngramSuggester.didYouMean("lost on
translation"));

If term position vectors are available it is possible to
make it context sensitive (or what one may call it) to
suggest a new term order.

    assertEquals("heroes might magic",
ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic",
ngramSuggester.didYouMean("heros on light and
magik"));
    assertEquals("best game made",
ngramSuggester.didYouMean("game best made"));
    assertEquals("game made",
ngramSuggester.didYouMean("made game"));
    assertEquals("game made",
ngramSuggester.didYouMean("made lame"));
    assertEquals("the game",
ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame",
ngramSuggester.didYouMean("in the game"));
    assertEquals("game",
ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may
game").size());

SessionAnalyzedDictionary is the adaptive layer, that learns
from how users changed their queries, what data they
inspected, et c. It will automagically find and suggest
synonyms, decomposed words, and probably a lot of other neat
features I still have not detected.

A bit depending on the situation, ignored suggestions get
suppresed and followed suggestions get suggeted even more.

    assertEquals("the da vinci code",
dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code",
dictionary.didYouMean("the davinci code"));

    assertEquals("homm",
dictionary.didYouMean("heroes of might and
magic"));
    assertEquals("heroes of might and magic",
dictionary.didYouMean("homm"));

    assertEquals("heroes of might and magic 2",
dictionary.didYouMean("heroes of might and magic
ii"));
    assertEquals("heroes of might and magic ii",
dictionary.didYouMean("heroes of might and magic
2"));


The adaptive layer is not yet(tm) persistent, but soft
referenced so that the dictionary don't go eat up all your
RAM.


  was:
>From javadocs:

 This is an adaptive, user query session analyzing spell
checker. In plain words, a word and phrase dictionary that
will learn from how users act while searching.

Be aware, this is a beta version. It is not finished, but
yeilds great results if you have enough user activity, RAM
and a faily narrow document corpus. The RAM problem can be
fixed if you implement your own subclass of SpellChecker as
the abstract methods of this class are the CRUD methods.
This will most probably change to a strategy class in future
version.

TODO:

1. Gram up results to detect compositewords that should not
be composite words, and vice verse.

2. Train a gramed token (markov) chain with output from an
expectation maximization algorithm (weka clusters?) parallel
to a closest path (A* or bredth first?) to allow contextual
suggestions on queries that never was placed.

Usage:

Training

At user query time, create an instance of QueryResults
containg the query string, number of hits and a time stamp.
Add it to a chronologically ordered list in the user session
(LinkedList makes sense) that you pass on to
train(sessionQueries) as the session times out.

You also want to call the bootstrap() method every 100000
queries or so.

Spell checking

Call getSuggestions(query) and look at the results. Don't
modify it! This method call will be hidden in a facade in
future version.

Note that the spell checker is case sensitive, so you want
to clean up query the same way when you train as when you
request the suggestions.

I recommend something like query =
query.toLowerCase().replaceAll(" ", "
").trim() 

    Lucene Fields: [Patch Available]
         Assignee: Karl Wettin
       Issue Type: Improvement  (was: New Feature)
          Summary: Extended spell checker with phrase
support and adaptive user session analysis.  (was: Adaptive,
user query session analyzing spell checker.)

All of the old comments was obsolete, so I re-initialized
the whole issue.

> Extended spell checker with phrase support and adaptive
user session analysis.
>
------------------------------------------------------------
------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> Some minor changes to how the single token ngram spell
checker in contrib/spellcheck, but nothing that breaks any
old implementation I think. Also fixed the broken test.
> NgramPhraseSuggestier tokenizes a query and suggests
combinations of the single token suggestions matrix.
> They must match as a query against an apriori index. By
using a span near query (default) you get features like
this:
>     assertEquals("lost in translation",
ngramSuggester.didYouMean("lost on
translation"));
> If term position vectors are available it is possible
to make it context sensitive (or what one may call it) to
suggest a new term order.
>     assertEquals("heroes might magic",
ngramSuggester.didYouMean("magic light heros"));
>     assertEquals("heroes of might and magic",
ngramSuggester.didYouMean("heros on light and
magik"));
>     assertEquals("best game made",
ngramSuggester.didYouMean("game best made"));
>     assertEquals("game made",
ngramSuggester.didYouMean("made game"));
>     assertEquals("game made",
ngramSuggester.didYouMean("made lame"));
>     assertEquals("the game",
ngramSuggester.didYouMean("the game"));
>     assertEquals("in the fame",
ngramSuggester.didYouMean("in the game"));
>     assertEquals("game",
ngramSuggester.didYouMean("same"));
>     assertEquals(0, ngramSuggester.suggest("may
game").size());
> SessionAnalyzedDictionary is the adaptive layer, that
learns from how users changed their queries, what data they
inspected, et c. It will automagically find and suggest
synonyms, decomposed words, and probably a lot of other neat
features I still have not detected.
> A bit depending on the situation, ignored suggestions
get suppresed and followed suggestions get suggeted even
more.
>     assertEquals("the da vinci code",
dictionary.didYouMean("thedavincicode"));
>     assertEquals("the da vinci code",
dictionary.didYouMean("the davinci code"));
>     assertEquals("homm",
dictionary.didYouMean("heroes of might and
magic"));
>     assertEquals("heroes of might and magic",
dictionary.didYouMean("homm"));
>     assertEquals("heroes of might and magic
2", dictionary.didYouMean("heroes of might and
magic ii"));
>     assertEquals("heroes of might and magic
ii", dictionary.didYouMean("heroes of might and
magic 2"));
> The adaptive layer is not yet(tm) persistent, but soft
referenced so that the dictionary don't go eat up all your
RAM.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-626) Extended spell checker with phrase support and adaptive user session an
user name
2007-02-02 20:23:05
     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

    Attachment:     (was: spellchecker.diff)

> Extended spell checker with phrase support and adaptive
user session analysis.
>
------------------------------------------------------------
------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> Some minor changes to how the single token ngram spell
checker in contrib/spellcheck, but nothing that breaks any
old implementation I think. Also fixed the broken test.
> NgramPhraseSuggestier tokenizes a query and suggests
combinations of the single token suggestions matrix.
> They must match as a query against an apriori index. By
using a span near query (default) you get features like
this:
>     assertEquals("lost in translation",
ngramSuggester.didYouMean("lost on
translation"));
> If term position vectors are available it is possible
to make it context sensitive (or what one may call it) to
suggest a new term order.
>     assertEquals("heroes might magic",
ngramSuggester.didYouMean("magic light heros"));
>     assertEquals("heroes of might and magic",
ngramSuggester.didYouMean("heros on light and
magik"));
>     assertEquals("best game made",
ngramSuggester.didYouMean("game best made"));
>     assertEquals("game made",
ngramSuggester.didYouMean("made game"));
>     assertEquals("game made",
ngramSuggester.didYouMean("made lame"));
>     assertEquals("the game",
ngramSuggester.didYouMean("the game"));
>     assertEquals("in the fame",
ngramSuggester.didYouMean("in the game"));
>     assertEquals("game",
ngramSuggester.didYouMean("same"));
>     assertEquals(0, ngramSuggester.suggest("may
game").size());
> SessionAnalyzedDictionary is the adaptive layer, that
learns from how users changed their queries, what data they
inspected, et c. It will automagically find and suggest
synonyms, decomposed words, and probably a lot of other neat
features I still have not detected.
> A bit depending on the situation, ignored suggestions
get suppresed and followed suggestions get suggeted even
more.
>     assertEquals("the da vinci code",
dictionary.didYouMean("thedavincicode"));
>     assertEquals("the da vinci code",
dictionary.didYouMean("the davinci code"));
>     assertEquals("homm",
dictionary.didYouMean("heroes of might and
magic"));
>     assertEquals("heroes of might and magic",
dictionary.didYouMean("homm"));
>     assertEquals("heroes of might and magic
2", dictionary.didYouMean("heroes of might and
magic ii"));
>     assertEquals("heroes of might and magic
ii", dictionary.didYouMean("heroes of might and
magic 2"));
> The adaptive layer is not yet(tm) persistent, but soft
referenced so that the dictionary don't go eat up all your
RAM.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Updated: (LUCENE-626) Extended spell checker with phrase support and adaptive user session an
user name
2007-02-02 20:23:05
     [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.
atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wettin updated LUCENE-626:
-------------------------------

    Attachment: spellchecker.diff

NgramPhraseSuggester is now decoupled from the adaptive
layer, but I would like to refactor it even more so it is
easy to replace the SpellChecker with any other single token
suggester.

> Extended spell checker with phrase support and adaptive
user session analysis.
>
------------------------------------------------------------
------------------
>
>                 Key: LUCENE-626
>                 URL: http
s://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> Some minor changes to how the single token ngram spell
checker in contrib/spellcheck, but nothing that breaks any
old implementation I think. Also fixed the broken test.
> NgramPhraseSuggestier tokenizes a query and suggests
combinations of the single token suggestions matrix.
> They must match as a query against an apriori index. By
using a span near query (default) you get features like
this:
>     assertEquals("lost in translation",
ngramSuggester.didYouMean("lost on
translation"));
> If term position vectors are available it is possible
to make it context sensitive (or what one may call it) to
suggest a new term order.
>     assertEquals("heroes might magic",
ngramSuggester.didYouMean("magic light heros"));
>     assertEquals("heroes of might and magic",
ngramSuggester.didYouMean("heros on light and
magik"));
>     assertEquals("best game made",
ngramSuggester.didYouMean("game best made"));
>     assertEquals("game made",
ngramSuggester.didYouMean("made game"));
>     assertEquals("game made",
ngramSuggester.didYouMean("made lame"));
>     assertEquals("the game",
ngramSuggester.didYouMean("the game"));
>     assertEquals("in the fame",
ngramSuggester.didYouMean("in the game"));
>     assertEquals("game",
ngramSuggester.didYouMean("same"));
>     assertEquals(0, ngramSuggester.suggest("may
game").size());
> SessionAnalyzedDictionary is the adaptive layer, that
learns from how users changed their queries, what data they
inspected, et c. It will automagically find and suggest
synonyms, decomposed words, and probably a lot of other neat
features I still have not detected.
> A bit depending on the situation, ignored suggestions
get suppresed and followed suggestions get suggeted even
more.
>     assertEquals("the da vinci code",
dictionary.didYouMean("thedavincicode"));
>     assertEquals("the da vinci code",
dictionary.didYouMean("the davinci code"));
>     assertEquals("homm",
dictionary.didYouMean("heroes of might and
magic"));
>     assertEquals("heroes of might and magic",
dictionary.didYouMean("homm"));
>     assertEquals("heroes of might and magic
2", dictionary.didYouMean("heroes of might and
magic ii"));
>     assertEquals("heroes of might and magic
ii", dictionary.didYouMean("heroes of might and
magic 2"));
> The adaptive layer is not yet(tm) persistent, but soft
referenced so that the dictionary don't go eat up all your
RAM.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


[1-10] [11-20] [21-30] [31-36]

about | contact  Other archives ( Real Estate discussion Medical topics )