It depends on the order of the filters in your Analyzer. You
would want
to be sure you put the StopWord filter before the Stemming
filter. The
reason that the MoreLikeThis class does not do as you want
is that first
it applies the Analyzer (which stems) and THEN it applies
its custom
stop word removal. If you pass an Analyzer that removes stop
words
before stemming, you don't have to worry about the stemming
at all. The
stopword 'uninteresting' would be removed before the
stemming even
occurred in the analyzer. The tokens from the analyzer would
then be fed
to the MoreLikeThis stop word removal scheme...but you could
just have
that list be empty as its too late anyway...you would have
already done
your stop word removal with the Analyzer rather than with
the
MoreLikeThis stop word removal scheme...which can only occur
after an
Analyzer has been fully applied to the text. Frankly, I
don't know why
MoreLikeThis supports its own stopword list...you can always
do it in a
custom analyzer that you pass to MoreLikeThis, which gives
you more
control of when the stopword removal is applied (say before
or after
stemming). Sugar I guess.
- Mark
Donna L Gresh wrote:
> I wasn't sure this:
> Instead add the stopwords to the analyzer that
>
>> you pass to MoreLikeThis. That way you can ensure
that the analyzer
>> applies the stopword list before stemming
>>
>
> would work, because I don't want to provide all the
variants of the
> stopword list-- if I do this, only the one provided
will be removed,
> correct?
>
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http:
//www.research.ibm.com/people/g/donnagresh
> gresh us.ibm.com
>
>
> Mark Miller <markrmiller gmail.com> wrote on
10/15/2007 10:37:22 AM:
>
>
>> Sounds right to me.
>>
>> The other option I think you have is to not use the
MoreLikeThis
>> stopword functionality. Instead add the stopwords
to the analyzer that
>> you pass to MoreLikeThis. That way you can ensure
that the analyzer
>> applies the stopword list before stemming (The
MoreLikeThis stopword
>> removal is implemented so that stopwords are
removed after stemming).
>> Then you just have to add 'developer' to the stop
list, and you can
>> forget about handling stemmed forms.
>>
>> Your method should also work though.
>>
>> - Mark
>>
>> Donna L Gresh wrote:
>>
>>> Could those "in the know" comment on
my current understanding of
>>>
> stemming
>
>>> and stopwords using the snowball analyzer?
>>>
>>> In my application, I am using the MoreLikeThis
class to find similar
>>> documents to an input "text blob".
There are words in the input text
>>>
> blob
>
>>> which are "uninteresting" for my
application, so I create a list of
>>>
> these
>
>>> words. These words are
"uninteresting" no matter what their tense or
>>> usage, for example, "develop",
"developing", "developed", and
>>>
> "developer"
>
>>> are all uninteresting and I do not want them
included in the search
>>>
> query
>
>>> created by the MoreLikeThis class.
>>>
>>> My index documents are stemmed using the
Snowball analyzer. I do not
>>>
> use
>
>>> any stopwords when the documents are indexed
(as I would like the
>>>
> choice
>
>>> of stopwords to be under user control at search
time).
>>>
>>> I would like the user to be able to provide to
the search application
>>>
> a
>
>>> list of "uninteresting" words, and
for obvious reasons would like to
>>>
> force
>
>>> them to provide only, say,
"developer" and have the application
>>>
> understand
>
>>> that all variants should be ignored (and I
don't want to force them to
>>>
> try
>
>>> to guess what the stemmed version of
"developer" is).
>>>
>>> My first try was to use MoreLikeThis with the
Snowball analyzer and a
>>> simple list of unstemmed stopwords
(MoreLikeThis.setAnalyzer and
>>> MoreLikeThis.setStopWords). However, it appears
that the stopwords
>>> provided to the MoreLikeThis class are compared
in an exact way to the
>>>
>
>
>>> token stream output by the Snowball filter
(where the words have been
>>> stemmed), so "developer" will not
match anything, and all variants
>>>
> pass
>
>>> through. Even if I provide the list of
unstemmed stopwords to the
>>>
> snowball
>
>>> analyzer instead, they are used
"as-is" with no stemming performed, so
>>>
>
>
>>> "developer" will not remove
"developed".
>>>
>>> Apparently the following is necessary for my
application:
>>> Construct a snowball analyzer with no
stopwords. Use the unstemmed
>>> stopword list with the analyzer to construct a
stemmed version of the
>>>
> set
>
>>> of stopwords. Use this set of stemmed stopwords
as the stopwords input
>>>
> to
>
>>> the MoreLikeThis class (where the tokens are
compared to the stemmed
>>> versions after been output from the Snowball
analyzer).
>>>
>>> Is my understanding correct?
>>>
>>> Donna
>>>
>>>
>>>
>>
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
>> For additional commands, e-mail: java-user-help lucene.apache.org
>>
>>
>
>
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|