Thanks a lot Peter!
Maria
On 10/18/07, Binkley, Peter <PBinkley mail.library.ualberta.ca> wrote:
> There's code in Nutch to identify the language of a
given text:
> http://lucene.apache.org/nutch/apidocs/o
rg/apache/nutch/analysis/lang/La
> nguageIdentifier.html .
>
> Peter
>
> -----Original Message-----
> From: Maria Mosolova [mailto:mmosolova gmail.com]
> Sent: Thursday, October 18, 2007 8:48 AM
> To: solr-user lucene.apache.org
> Subject: Re: multilingual list of stopwords
>
> Thanks a lot to everyone who responded. Yes, I agree
that eventually we
> need to use separate stopword lists for different
languages.
> Unfortunately the data we are trying to index at the
moment does not
> contain any direct country/language information and we
need to create
> the first version of the index quickly. It does not
look like analyzing
> documents to determine their languge is something which
could be
> accomplished in a very limited timeframe. Or am I wrong
here and there
> are existing analyzers one could use?
> Maria
>
> On 10/18/07, Walter Underwood <wunderwood netflix.com> wrote:
> > Also "die" in German and English.
--wunder
> >
> > On 10/18/07 4:16 AM, "Andrzej Bialecki"
<ab getopt.org> wrote:
> >
> > > One example that I'm familiar with: words
"is" and "by" in English
> > > and in Swedish. Both words are stopwords in
English, but they are
> > > content words in Swedish (ice and village,
respectively). Similarly,
>
> > > "till" in Swedish is a stopword
(to, towards), but it's a content
> word in English.
> >
> >
>
>
|