List Info

Thread: Is there bug in CJKAnalyzer?




Is there bug in CJKAnalyzer?
country flaguser name
Bulgaria
2007-10-22 10:39:05
Hi Guys,

I have made tests with the CJKAnalyzer and the results show
something 
that seems very strange to me. First I have to say that I do
not 
understand non of the CJK languages.
What I do is the following I write some text in English and
translate it 
using an on-line tool, which give me the translated script
per word or 
per group of words. The translated text I put in separate
files and 
index them using proper encoding for readers.
What is strange is that when searching just one hieroglyph
(no matter if 
it is separate word in the text or part of a word) Lucene
almost never 
finds result (may be only in less than 5% find results for
word like – 
that=那, commas and so).
I also copy/pasted text from Chinese Academy of Science web
site to 
ignore results in case the translation toll does not work
correctly. The 
result is the same.
But when searching for two or more consequent hieroglyphs
everything is 
OK if they persist in the text they are found.

So my question is: Is this normal behavior for CJKAnalyzer
– not to find 
results when only one hieroglyph is searched or there is
some bug with 
that Analyzer?

I also would like to say that I reindexed with a very simple
class (not 
with our searching engine) to ignore any possible mistakes.
The results 
are the same.

I will give the example of the text that I use:

English:

The quick brown fox jumped over the lazy dog.

Chinese:

灵布朗狐逾懒狗。

English word by word:

|NA The |1 quick |2 brown |3 fox |4 jumped over |NA the |5
lazy |6 dog |7.

Responding Chinese words:

|1 灵 |2 布朗 |3 狐 |4 逾 |5 懒 | 6 狗 |7。

NOTE: My files contain only the Chinese text.

Best Regards,
Ivan


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Is there bug in CJKAnalyzer?
user name
2007-10-22 11:29:13
Hi,

For a chinese token like ABCD (where A,B,C and D are chinese
signs),
CJKAnalyzer will generate the following overlapping bigrams:
AB  BC  CD.
Thus issuing a query containing one chinese sign will not
retrieve any
documents.  To overcome this, you have to index chinese
characters as single
tokens (this will increase recall, but decrease precision).

Hope this will help,
Samir



2007/10/22, Ivan Vasilev <ivasilevsirma.bg>:
>
> Hi Guys,
>
> I have made tests with the CJKAnalyzer and the results
show something
> that seems very strange to me. First I have to say that
I do not
> understand non of the CJK languages.
> What I do is the following I write some text in English
and translate it
> using an on-line tool, which give me the translated
script per word or
> per group of words. The translated text I put in
separate files and
> index them using proper encoding for readers.
> What is strange is that when searching just one
hieroglyph (no matter if
> it is separate word in the text or part of a word)
Lucene almost never
> finds result (may be only in less than 5% find results
for word like C
> that=, commas and so).
> I also copy/pasted text from Chinese Academy of Science
web site to
> ignore results in case the translation toll does not
work correctly. The
> result is the same.
> But when searching for two or more consequent
hieroglyphs everything is
> OK if they persist in the text they are found.
>
> So my question is: Is this normal behavior for
CJKAnalyzer C not to find
> results when only one hieroglyph is searched or there
is some bug with
> that Analyzer?
>
> I also would like to say that I reindexed with a very
simple class (not
> with our searching engine) to ignore any possible
mistakes. The results
> are the same.
>
> I will give the example of the text that I use:
>
> English:
>
> The quick brown fox jumped over the lazy dog.
>
> Chinese:
>
> 鲼ʺ
>
> English word by word:
>
> |NA The |1 quick |2 brown |3 fox |4 jumped over |NA the
|5 lazy |6 dog |7.
>
> Responding Chinese words:
>
> |1  |2  |3  |4  |5  | 6  |7
>
> NOTE: My files contain only the Chinese text.
>
> Best Regards,
> Ivan
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>
Re: Is there bug in CJKAnalyzer?
country flaguser name
Bulgaria
2007-10-23 07:08:43
Thanks Samir 

You info was really helpful for us. I saw the index by Luke
and there
the Chinese signs were split in pairs as you said C AB BC
CD etc. Also
when querying for ABC it is split in the query to AB BC.
But how to understand the meaning of this:
To overcome this, you have to index chinese characters as
single tokens
(this will increase recall, but decrease precision).

I understand it so: To increase the results I have to use
instead of the Chinese another analyzer that makes
tokenization of the text character by character. 

Do you know if the range searches  work correctly with CJK
texts when during the indexing was used CJKAnalyzer? My
opinion is that as the terms are sorted by using
String.copmareTo method but not Collator.compare then the
range searches will as close to correct as the second method
is close to the first one.
Interesting is when passing more than 2 Chinese characters
C then the Lucene query is not split to bigram sub queries
(as in the normal queries) and may be some incorrect results
can come because of this.

Example: there is indexed the text: AAB. The index will
contain: AA AB.
User range search is: content:[AG TO PQ]
Here may be B will mean some word, and user will expect it
in the results (as B is after AG and before PQ), but as it
is not single token but in pair AB which is before AG, it
will not be returned.
Here may be I guess your answer C to prevent this use per
character analyzer  

Thanks once again Samir for your explanation of how
CJKAnalyzer works it was really calming for me to know that
my CJKAnalyzer work as it is expected. 

Best Regards,
Ivan



Samir Abdou wrote:
> Hi,
>
> For a chinese token like ABCD (where A,B,C and D are
chinese signs),
> CJKAnalyzer will generate the following overlapping
bigrams: AB  BC  CD.
> Thus issuing a query containing one chinese sign will
not retrieve any
> documents.  To overcome this, you have to index chinese
characters as single
> tokens (this will increase recall, but decrease
precision).
>
> Hope this will help,
> Samir
>
>
>
> 2007/10/22, Ivan Vasilev <ivasilevsirma.bg>:
>   
>> Hi Guys,
>>
>> I have made tests with the CJKAnalyzer and the
results show something
>> that seems very strange to me. First I have to say
that I do not
>> understand non of the CJK languages.
>> What I do is the following I write some text in
English and translate it
>> using an on-line tool, which give me the translated
script per word or
>> per group of words. The translated text I put in
separate files and
>> index them using proper encoding for readers.
>> What is strange is that when searching just one
hieroglyph (no matter if
>> it is separate word in the text or part of a word)
Lucene almost never
>> finds result (may be only in less than 5% find
results for word like C
>> that=, commas and so).
>> I also copy/pasted text from Chinese Academy of
Science web site to
>> ignore results in case the translation toll does
not work correctly. The
>> result is the same.
>> But when searching for two or more consequent
hieroglyphs everything is
>> OK if they persist in the text they are found.
>>
>> So my question is: Is this normal behavior for
CJKAnalyzer C not to find
>> results when only one hieroglyph is searched or
there is some bug with
>> that Analyzer?
>>
>> I also would like to say that I reindexed with a
very simple class (not
>> with our searching engine) to ignore any possible
mistakes. The results
>> are the same.
>>
>> I will give the example of the text that I use:
>>
>> English:
>>
>> The quick brown fox jumped over the lazy dog.
>>
>> Chinese:
>>
>> 鲼ʺ
>>
>> English word by word:
>>
>> |NA The |1 quick |2 brown |3 fox |4 jumped over |NA
the |5 lazy |6 dog |7.
>>
>> Responding Chinese words:
>>
>> |1  |2  |3  |4  |5  | 6  |7
>>
>> NOTE: My files contain only the Chinese text.
>>
>> Best Regards,
>> Ivan
>>
>>
>>
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
>> For additional commands, e-mail: java-user-helplucene.apache.org
>>
>>
>>     


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Is there bug in CJKAnalyzer?
country flaguser name
United States
2007-10-23 10:38:47
Hi Ivan,

Ivan Vasilev wrote:
> But how to understand the meaning of this: To
overcome this, you
> have to index chinese characters as single tokens (this
will increase
> recall, but decrease precision).
> 
> I understand it so: To increase the results I have to
use instead of 
> the Chinese another analyzer that makes tokenization of
the text 
> character by character.

StandardTokenizer[1] produces single-character tokens for
Chinese
ideographs and Japanese kana.

However, AFAIK, you will no longer be able to perform range
searches
like [AG TO PQ], because the terms "AG" and
"PQ" will not be present in
the index.  [A TO P] should work, but I don't know how
useful the
results would be, since this would match all words that
contain the
ideographs [A TO P], not just those that start with them. 
(Note that
this is also the case with the bigram tokens produced by
CJKAnalyzer.)

By the way, what is the use case for matching a range of
words?  Doesn't
exposing this kind of functionality cause performance
concerns?

Steve

[1] Lucene's StandardTokenizer API doc:
<http://luc
ene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/sta
ndard/StandardTokenizer.html>

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.o
rg/tech/lucene.asp

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Is there bug in CJKAnalyzer?
country flaguser name
Bulgaria
2007-10-24 05:33:11
Hi Steven,

Thank you very match for your answer.
I tested with the StandardAnalyzer and it really tokenizes
the text
ideograph by ideograph. May be as Samir says in his mail
this is not
convenient for people who use CJK language because too lot
of documents
will match. But the think is in this case (when using
StandardAnalyzer)
the range searches work correctly. I tested it. The logic is
the same as
in English range searches. If in English you have the word
"brown" and
some tokenizer tokenizes it letter by letter like this: 'b'
'r' 'o' 'w'
'n', and then you can search for more than one character.
For example
consider the following search - content:[aaa TO ccc] - then
the token
'b' will be found.
Yes for letter based languages it does not make sense to
tokenize letter
by letter, of course. But in CJK in great number of cases,
as I know,
single ideographs are separate words, or even group of
words.
I tested range searches of the Chinese text indexed with
StandardAnalyzer and everything in this context is OK.
The searches:
content:[u0E80 TO u0E80]
content:[u0E80u0E80 TO u0E80]
content:[u0E80u0E80u0E80 TO u0E80u0E80]
content:[u0E80u0E80u0E80 TO u0E80u0E80]

not only work but return the same result set as:
content:[u0E80 TO ]

Here u0E80 is the first ideograph of CJK Unicode code
points and  is
some ideograph persisting in some of the text files.
This of course works also with the CJKAnalyzer. But with
StandardAnalyzer will be avoided, I think, the case that I
describe in
my previous mail.

So I know range searches are a bit slower but I just fulfil
the
requirement of our customers. They will decide if range
searches are
convenient or not and whet Analyzer will better help them.

Thanks once again 

Best Regards,
Ivan

Steven Rowe wrote:
> Hi Ivan,
>
> Ivan Vasilev wrote:
>   
>> But how to understand the meaning of this: To
overcome this, you
>> have to index chinese characters as single tokens
(this will increase
>> recall, but decrease precision).
>>
>> I understand it so: To increase the results I have
to use instead of 
>> the Chinese another analyzer that makes
tokenization of the text 
>> character by character.
>>     
>
> StandardTokenizer[1] produces single-character tokens
for Chinese
> ideographs and Japanese kana.
>
> However, AFAIK, you will no longer be able to perform
range searches
> like [AG TO PQ], because the terms "AG" and
"PQ" will not be present in
> the index.  [A TO P] should work, but I don't know how
useful the
> results would be, since this would match all words that
contain the
> ideographs [A TO P], not just those that start with
them.  (Note that
> this is also the case with the bigram tokens produced
by CJKAnalyzer.)
>
> By the way, what is the use case for matching a range
of words?  Doesn't
> exposing this kind of functionality cause performance
concerns?
>
> Steve
>
> [1] Lucene's StandardTokenizer API doc:
> <http://luc
ene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/sta
ndard/StandardTokenizer.html>
>
>   


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )