List Info

Thread: RE: multilingual list of stopwords




RE: multilingual list of stopwords
country flaguser name
Canada
2007-10-18 10:14:36
There's code in Nutch to identify the language of a given
text:
http://lucene.apache.org/nutch/apidocs/o
rg/apache/nutch/analysis/lang/La
nguageIdentifier.html .

Peter 

-----Original Message-----
From: Maria Mosolova [mailto:mmosolovagmail.com] 
Sent: Thursday, October 18, 2007 8:48 AM
To: solr-userlucene.apache.org
Subject: Re: multilingual list of stopwords

Thanks a lot to everyone who responded. Yes, I agree that
eventually we
need to use separate stopword lists for different
languages.
Unfortunately the data we are trying to index at the moment
does not
contain any direct country/language information and we need
to create
the first version of the index quickly. It does not look
like analyzing
documents to determine their languge is something which
could be
accomplished in a very limited timeframe. Or am I wrong here
and there
are existing analyzers one could use?
Maria

On 10/18/07, Walter Underwood <wunderwoodnetflix.com> wrote:
> Also "die" in German and English. --wunder
>
> On 10/18/07 4:16 AM, "Andrzej Bialecki"
<abgetopt.org> wrote:
>
> > One example that I'm familiar with: words
"is" and "by" in English 
> > and in Swedish. Both words are stopwords in
English, but they are 
> > content words in Swedish (ice and village,
respectively). Similarly,

> > "till" in Swedish is a stopword (to,
towards), but it's a content
word in English.
>
>


Re: multilingual list of stopwords
user name
2007-10-18 10:18:47
Thanks a lot Peter!
Maria

On 10/18/07, Binkley, Peter <PBinkleymail.library.ualberta.ca> wrote:
> There's code in Nutch to identify the language of a
given text:
> http://lucene.apache.org/nutch/apidocs/o
rg/apache/nutch/analysis/lang/La
> nguageIdentifier.html .
>
> Peter
>
> -----Original Message-----
> From: Maria Mosolova [mailto:mmosolovagmail.com]
> Sent: Thursday, October 18, 2007 8:48 AM
> To: solr-userlucene.apache.org
> Subject: Re: multilingual list of stopwords
>
> Thanks a lot to everyone who responded. Yes, I agree
that eventually we
> need to use separate stopword lists for different
languages.
> Unfortunately the data we are trying to index at the
moment does not
> contain any direct country/language information and we
need to create
> the first version of the index quickly. It does not
look like analyzing
> documents to determine their languge is something which
could be
> accomplished in a very limited timeframe. Or am I wrong
here and there
> are existing analyzers one could use?
> Maria
>
> On 10/18/07, Walter Underwood <wunderwoodnetflix.com> wrote:
> > Also "die" in German and English.
--wunder
> >
> > On 10/18/07 4:16 AM, "Andrzej Bialecki"
<abgetopt.org> wrote:
> >
> > > One example that I'm familiar with: words
"is" and "by" in English
> > > and in Swedish. Both words are stopwords in
English, but they are
> > > content words in Swedish (ice and village,
respectively). Similarly,
>
> > > "till" in Swedish is a stopword
(to, towards), but it's a content
> word in English.
> >
> >
>
>

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )