List Info

Thread: common words not stop words?? how to ??




common words not stop words?? how to ??
user name
2007-02-19 03:22:38
Hi:

I was wondering how are you guys dealing with "common
words"? What I
mean by common words  is the ones that fall outside the
"stop words"
category. Offcourse "stop words" is subjective
i.e. its up to the
implementor. What I would like to do is how do i increase or
decrease
boost value based on such "common words". Should I
have a field
"Common_Words_Plus" and
"Common_Words_Minus"? Plus for words that
needs to be boosted up and minus for the words that gets
boosted
down?.. No?

The above sounds like not so professional -- quick fix..
does any one
have a better solution.. how are you dealing with the
above?

Regards

Re: common words not stop words?? how to ??
user name
2007-02-19 10:32:29
Lucene/Solr does this automatically. That is how a tf.idf
engine works, it boosts rare words.

Do you have examples of problems or are you worrying about
something that might happen?

wunder

On 2/19/07 1:22 AM, "rubdabadub"
<rubdabadubgmail.com> wrote:

> Hi:
> 
> I was wondering how are you guys dealing with
"common words"? What I
> mean by common words  is the ones that fall outside the
"stop words"
> category. Offcourse "stop words" is
subjective i.e. its up to the
> implementor. What I would like to do is how do i
increase or decrease
> boost value based on such "common words".
Should I have a field
> "Common_Words_Plus" and
"Common_Words_Minus"? Plus for words that
> needs to be boosted up and minus for the words that
gets boosted
> down?.. No?
> 
> The above sounds like not so professional -- quick
fix.. does any one
> have a better solution.. how are you dealing with the
above?
> 
> Regards


Re: common words not stop words?? how to ??
user name
2007-02-19 13:28:20
Walter:

Thanks for the feedback.

On 2/19/07, Walter Underwood <wunderwoodnetflix.com> wrote:
> Lucene/Solr does this automatically. That is how a
tf.idf
> engine works, it boosts rare words.
>
> Do you have examples of problems or are you worrying
about
> something that might happen?

Actually my use case is the following: Lets say
hypothetically you
have a field with 100 "sentence long title". If
you read those title
you can pretty much group them into 5 subject matter. A
hypothetical
example  is.. (Total number of title is 125, 25 of them can
not be
grouped)

22 title is about = How good is Person X
14 title is about = How bad is Product Y
10 title is about = bond weather
36 title is about = How cool is the movie Z
18 title is about = The next big MS virus.

What I am trying to achive is

I would like to weed out "bond weather" as a group
cos it is not
interesting in my use case .. Lets say it is noise not
signal. So I
thought I could use some "common words" .. 
Furthermore I was thinking
having common words .. I could boost certain field i.e. if
the Person
X is a known person example a "Prime minister" or
" a "movie star"
having certain word attached to another known word meaning
its
important.  Maybe I defined my problem wrongly.. I hope
above gives
you an overview..

Regards

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )