List Info

Thread: Re: add CJKTokenizer to solr




Re: add CJKTokenizer to solr
country flaguser name
Croatia
2007-06-22 05:17:55
I'm jumping in the middle of the thread here.
CJK = Chinese, Japanese, Korean
German = etwas ganz anderes
Why are you trying to use CJKAnalyzer+Tokenizer for German? 
Have you tried German Analyzer from Lucene contrib?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  - 
Tag  -  Search  -  Share

----- Original Message ----
From: Xuesong Luo <xluosuccessfactors.com>
To: solr-userlucene.apache.org
Sent: Friday, June 22, 2007 8:54:37 AM
Subject: RE: add CJKTokenizer to solr

Thanks, Toru and Chris,
I tried both the CJKTokenizer and CJKAnalyzer. Both return
some unexpected highlight results when I tested with
Germany. The field value I searched is "Ein Mann beißt
den Hund".  The search criteria is beißt. 

When using CJKAnalyzer, beißt is treated as 2 single
terms(bei and ß) the highlight result is: 
<str>Ein Mann
<em>bei</em><em>ß</em>t den
Hund</str> 

When using CJKTokenizer, beißt is treated as 3 single terms,
the result is:
<str>Ein Mann
<em>bei</em><em>ß</em><em>t<
;/em> den Hund</str>

When using standard tokenizer, beißt is treated as a word,
the result is:
<str>Ein Mann <em>beißt</em> den
Hund</str>


I understand why the standard tokenizer treat beißt as a
word, but don't know how CJKAnalyzer and CJKAnalyzer work,
could anyone explain a little bit?


Thanks
Xuesong

-----Original Message-----
From: Toru Matsuzawa [mailto:matsuccs.co.jp] 
Sent: Monday, June 18, 2007 10:29 PM
To: solr-userlucene.apache.org
Subject: Re: add CJKTokenizer to solr

I'm sorry. Because it was not possible to append it, 
it sends it again. 

> > I got the error below after adding CJKTokenizer to
schema.xml.  I
> > checked the constructor of CJKTokenizer, it
requires a Reader parameter,
> > I guess that's why I get this error, I searched
the email archive, it
> > seems working for other users. Does anyone know
what is the problem?
> 
> 
> CJKTokenizerFactory that I am using is appended.
> 
--
package org.apache.solr.analysis.ja;

import java.io.Reader;
import org.apache.lucene.analysis.cjk.CJKTokenizer ;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;

/**
 * CJKTokenizer for Solr
 * see org.apache.lucene.analysis.cjk.CJKTokenizer
 * author matsu
 *
 */
public class CJKTokenizerFactory extends
BaseTokenizerFactory {

  /**
   * see
org.apache.solr.analysis.TokenizerFactory#create(Reader)
   */
  public TokenStream create(Reader input) {
    return new CJKTokenizer( input );
  }

}


-- 
Trou Matsuzawa







[1]

about | contact  Other archives ( Real Estate discussion Medical topics )