List Info

Thread: multilingual list of stopwords




multilingual list of stopwords
user name
2007-10-17 22:18:51
Hi,

I am looking for a multilingual list of stopwords to use
with
Solr/Lucene and would greatly appreciate an advice on where
I could
find it.

Thanks,

Maria

Re: multilingual list of stopwords
country flaguser name
United Kingdom
2007-10-18 03:27:11
	Hi Maria,

this is a "me too". ;)
At the moment I'll take the way to merge the various
language stopword
files I need to one and use it. But the main problem in this
case is,
having collusions with words which are stopwords in one
language and in
the other not.

	Cheers,
	Joe


Maria Mosolova schrieb:
> I am looking for a multilingual list of stopwords to
use with
> Solr/Lucene and would greatly appreciate an advice on
where I could
> find it.


Re: multilingual list of stopwords
user name
2007-10-18 05:30:47
Hi,

I haven't heard of multilingual stop words list before. What
should be the
purpose of it? This seems to odd to me 
Stop words are used to cut down the size of index.

One way you can go about this is to create your own list by
indexing your
documents (without stop words removed) and then looking at
the most frequent
words and create the list by picking some of them. This
could work if you
want to index static set of documents (so you know what your
content is all
about and you can leave some words without loosing any
important
information).

But I think the preferred way is to identify language first
and then use
specific language stop list.

If you can't use language identification then you can try
creative ways
like:
Employing some kind of document classification algorithm and
then creating
stop lists for each class. Then with every new document you
will determine
first in which class it belongs and then apply particular
stop list.
I am just sucking the wind here...

Regards,
Lukas

On 10/18/07, Joseph Doehr <dandracomyahoo.de> wrote:
>
>
>         Hi Maria,
>
> this is a "me too". ;)
> At the moment I'll take the way to merge the various
language stopword
> files I need to one and use it. But the main problem in
this case is,
> having collusions with words which are stopwords in one
language and in
> the other not.
>
>         Cheers,
>         Joe
>
>
> Maria Mosolova schrieb:
> > I am looking for a multilingual list of stopwords
to use with
> > Solr/Lucene and would greatly appreciate an advice
on where I could
> > find it.
>
>


-- 
http://blog.lukas-vlcek.
com/
Re: multilingual list of stopwords
country flaguser name
Poland
2007-10-18 06:16:56
Lukas Vlcek wrote:
> Hi,
> 
> I haven't heard of multilingual stop words list before.
What should be the
> purpose of it? This seems to odd to me 

That's because multilingual stopword list doesn't make sense
;)

One example that I'm familiar with: words "is" and
"by" in English and 
in Swedish. Both words are stopwords in English, but they
are content 
words in Swedish (ice and village, respectively). Similarly,
"till" in 
Swedish is a stopword (to, towards), but it's a content word
in English.

So, as Lukas correctly suggested, you should first perform
language 
identification, and then apply the correct stopword list.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||/|  Information Retrieval, Semantic Web
___|||__||  |  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com


Re: multilingual list of stopwords
country flaguser name
United States
2007-10-18 07:52:14
Are you sure they don't just mean they want separate
stopword lists  
for various different indexes in different languages? 
Otherwise, I  
agree, it doesn't make much sense for a single mixed
language index  
(unless you had an intelligent filter that could select
based on  
language.)

Maria, perhaps you have specific languages you are looking
for?  I  
would just Google for <Language> stopword list and see
what comes  
up.  There are a lot of multilingual resources out there.

-Grant

On Oct 18, 2007, at 7:16 AM, Andrzej Bialecki wrote:

> Lukas Vlcek wrote:
>> Hi,
>> I haven't heard of multilingual stop words list
before. What  
>> should be the
>> purpose of it? This seems to odd to me 
>
> That's because multilingual stopword list doesn't make
sense ;)
>
> One example that I'm familiar with: words
"is" and "by" in English  
> and in Swedish. Both words are stopwords in English,
but they are  
> content words in Swedish (ice and village,
respectively).  
> Similarly, "till" in Swedish is a stopword
(to, towards), but it's  
> a content word in English.
>
> So, as Lukas correctly suggested, you should first
perform language  
> identification, and then apply the correct stopword
list.
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
__________________________________
> [__ || __|__/|__||/|  Information Retrieval, Semantic
Web
> ___|||__||  |  ||  |  Embedded Unix, System
Integration
> http://www.sigram.com 
Contact: info at sigram dot com
>

--------------------------
Grant Ingersoll
http://lucene.granti
ngersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http:// 
www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance

http://w
iki.apache.org/lucene-java/LuceneFAQ



[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )