List Info

Thread: Re: default text type and stop words




Re: default text type and stop words
user name
2007-11-05 23:59:47
: This isn't a problem in Lucene or Solr. It is a result of
the analyzers
: you have chosen to use. If you choose to remove stopwords,
you will not
: be able to match stopwords.

I believe paul's point was that this use of stopwords is in
the "text" 
fieldtype in the example schema.xml ... which many people
use as is.

I'm personally of the mindset that it's fine like it is. 
While people who 
understand that "an" is a stop word might ask
"why does 'rating:PG AND 
name:an' match 40K movies, it should match 0?" there is
another (probably 
larger) group of people who won't know how the search is
implemented, or 
that "an" is a stop word, and they will look at
the same results and ask 
"why am i getting 40K results? most of these don't have
'an' in the title? 
i should only be getting X results."

That second group of people aren't going to be any happier
if you 
give them 0 results instead -- at least this way people get
some results 
to work with.


-Hoss


Re: default text type and stop words
user name
2007-11-06 00:36:23
I also said, "Stopword removal is a reasonable default
because it works
fairly well for a general text corpus." Ultraseek keeps
stopwords but
most engines don't. I think it is fine as a default. I also
think you
have to understand stopwords at some point.

wunder

On 11/5/07 9:59 PM, "Chris Hostetter"
<hossman_lucenefucit.org> wrote:

> 
> : This isn't a problem in Lucene or Solr. It is a
result of the analyzers
> : you have chosen to use. If you choose to remove
stopwords, you will not
> : be able to match stopwords.
> 
> I believe paul's point was that this use of stopwords
is in the "text"
> fieldtype in the example schema.xml ... which many
people use as is.
> 
> I'm personally of the mindset that it's fine like it
is.  While people who
> understand that "an" is a stop word might ask
"why does 'rating:PG AND
> name:an' match 40K movies, it should match 0?"
there is another (probably
> larger) group of people who won't know how the search
is implemented, or
> that "an" is a stop word, and they will look
at the same results and ask
> "why am i getting 40K results? most of these don't
have 'an' in the title?
> i should only be getting X results."
> 
> That second group of people aren't going to be any
happier if you
> give them 0 results instead -- at least this way people
get some results
> to work with.
> 
> -Hoss



[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )