I'm jumping in the middle of the thread here.
CJK = Chinese, Japanese, Korean
German = etwas ganz anderes
Why are you trying to use CJKAnalyzer+Tokenizer for German?
Have you tried German Analyzer from Lucene contrib?
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ -
Tag - Search - Share
----- Original Message ----
From: Xuesong Luo <xluo successfactors.com>
To: solr-user lucene.apache.org
Sent: Friday, June 22, 2007 8:54:37 AM
Subject: RE: add CJKTokenizer to solr
Thanks, Toru and Chris,
I tried both the CJKTokenizer and CJKAnalyzer. Both return
some unexpected highlight results when I tested with
Germany. The field value I searched is "Ein Mann beißt
den Hund". The search criteria is beißt.
When using CJKAnalyzer, beißt is treated as 2 single
terms(bei and ß) the highlight result is:
<str>Ein Mann
<em>bei</em><em>ß</em>t den
Hund</str>
When using CJKTokenizer, beißt is treated as 3 single terms,
the result is:
<str>Ein Mann
<em>bei</em><em>ß</em><em>t<
;/em> den Hund</str>
When using standard tokenizer, beißt is treated as a word,
the result is:
<str>Ein Mann <em>beißt</em> den
Hund</str>
I understand why the standard tokenizer treat beißt as a
word, but don't know how CJKAnalyzer and CJKAnalyzer work,
could anyone explain a little bit?
Thanks
Xuesong
-----Original Message-----
From: Toru Matsuzawa [mailto:matsu ccs.co.jp]
Sent: Monday, June 18, 2007 10:29 PM
To: solr-user lucene.apache.org
Subject: Re: add CJKTokenizer to solr
I'm sorry. Because it was not possible to append it,
it sends it again.
> > I got the error below after adding CJKTokenizer to
schema.xml. I
> > checked the constructor of CJKTokenizer, it
requires a Reader parameter,
> > I guess that's why I get this error, I searched
the email archive, it
> > seems working for other users. Does anyone know
what is the problem?
>
>
> CJKTokenizerFactory that I am using is appended.
>
--
package org.apache.solr.analysis.ja;
import java.io.Reader;
import org.apache.lucene.analysis.cjk.CJKTokenizer ;
import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;
/**
* CJKTokenizer for Solr
* see org.apache.lucene.analysis.cjk.CJKTokenizer
* author matsu
*
*/
public class CJKTokenizerFactory extends
BaseTokenizerFactory {
/**
* see
org.apache.solr.analysis.TokenizerFactory#create(Reader)
*/
public TokenStream create(Reader input) {
return new CJKTokenizer( input );
}
}
--
Trou Matsuzawa
|