Are you sure they don't just mean they want separate
stopword lists
for various different indexes in different languages?
Otherwise, I
agree, it doesn't make much sense for a single mixed
language index
(unless you had an intelligent filter that could select
based on
language.)
Maria, perhaps you have specific languages you are looking
for? I
would just Google for <Language> stopword list and see
what comes
up. There are a lot of multilingual resources out there.
-Grant
On Oct 18, 2007, at 7:16 AM, Andrzej Bialecki wrote:
> Lukas Vlcek wrote:
>> Hi,
>> I haven't heard of multilingual stop words list
before. What
>> should be the
>> purpose of it? This seems to odd to me
>
> That's because multilingual stopword list doesn't make
sense ;)
>
> One example that I'm familiar with: words
"is" and "by" in English
> and in Swedish. Both words are stopwords in English,
but they are
> content words in Swedish (ice and village,
respectively).
> Similarly, "till" in Swedish is a stopword
(to, towards), but it's
> a content word in English.
>
> So, as Lukas correctly suggested, you should first
perform language
> identification, and then apply the correct stopword
list.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _
__________________________________
> [__ || __|__/|__||/| Information Retrieval, Semantic
Web
> ___|||__|| | || | Embedded Unix, System
Integration
> http://www.sigram.com
Contact: info at sigram dot com
>
--------------------------
Grant Ingersoll
http://lucene.granti
ngersoll.com
Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://
www.apachecon.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://w
iki.apache.org/lucene-java/LuceneFAQ
|