List Info

Thread: Is there bug in Range searches?




Is there bug in Range searches?
country flaguser name
Bulgaria
2007-10-21 11:50:50
Hi Guys,
There is something in the Lucene that disturbs me. My
question is about 
sorting. In the queries there are used collator objects that
sort the 
results (in the class FieldSortedHitQueue). But in the
indexing process 
they are not used. As I now all the terms are ordered during
the 
indexing process. Based on this ordering are performed Range
searches. 
Currently I think the sorting during the indexing process is
made in the 
classes org.apache.lucene.index.Term and TermBuffer. But
there is used 
code like this:

text.compareTo(other.text);

or character comparison in the char arrays in the class
TermBuffer.

but not collator.compare(text, othre.text);

    So my questions are:

   1. Are Range queries work correctly with all languages
for which
      there are analyzers? (for example CJK and Thai);
   2. If the answer is no can I fix this problem by adding
collators in
      the above classes?

Best Regards,

Ivan

 

 

 


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Is there bug in Range searches?
user name
2007-10-22 00:25:19
:   1. Are Range queries work correctly with all languages
for which
:      there are analyzers? (for example CJK and Thai);

Terms when indexed are allways ordered lexigraphically
(using 
Term.compareTo which uses String.compareTo) ... regardless
of what field 
or language they are in, so "Range Queries" must
do their comparisons 
lexigraphically as well.

because all Terms are indexed in one continuous TermEnum, it
would be 
fairly imposible to definite different Collators per field
at index time.

it's pretty rare that anyone ever talks about Range Queries
on words, so 
this doesn't typiclly pose a problem ... i've never seen
anyone comment 
that they can't do sane Range queries on their text because
the language 
it's in doesn't collate in the same order as the default
compareTo.




-Hoss


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Is there bug in Range searches?
country flaguser name
Bulgaria
2007-10-22 10:11:29
10x Hoss for the answer. It is good news that this topic is
very rare 
and clients do not complain about this. I hope our clients
will also not 
complain 

Looking strictly at this I think this leads to a non correct
behavior on 
indexing applications, but as there are no unsatisfied
clients may be no 
more comments about this are needed.
I will just explain what I mean when speaking about
incorrect behaviour. 
Imagine index contains German files content. There are the
following 
words indexed: "Alle", "beschlossen",
"foto", "für", "Ältere".
This is the order of the words in the index, because this is
the order 
imposed by String.compareTo method. But when using German
collator it 
always puts the word "Ältere" after
"Alle" and before "beschlossen", no 
matter what strength level is used. This means that when
sorting we will 
have the words in the following order: "Alle",
"Ältere", "beschlossen", 
"foto", "für". And now imagine user
makes range query like this - 
content:(Alle TO foto) and expects the items containing
"Ältere" and 
"beschlossen" to be returned. But as in the index
the word "Ältere" is 
at the last place it's entry will not be returned, but only
those of 
"beschlossen" will persist in the results.
So in German the problem comes only with 3 umlaut letters
but as we will 
have clients using CJK documents I do not now how big is the
difference 
between String.compareTo and Collator.compare methods for
those languages.
I hope we will not have problems with those clients, but if
any I will 
change a bit source of Term and TermBuffer classes. By the
way I have 
experience with changes in those clases where I made values
of numeric 
fields to be ordered in numeric order but not in
lexicographical one.

Thanks ones again.
Ivan

Chris Hostetter wrote:
> :   1. Are Range queries work correctly with all
languages for which
> :      there are analyzers? (for example CJK and
Thai);
>
> Terms when indexed are allways ordered lexigraphically
(using 
> Term.compareTo which uses String.compareTo) ...
regardless of what field 
> or language they are in, so "Range Queries"
must do their comparisons 
> lexigraphically as well.
>
> because all Terms are indexed in one continuous
TermEnum, it would be 
> fairly imposible to definite different Collators per
field at index time.
>
> it's pretty rare that anyone ever talks about Range
Queries on words, so 
> this doesn't typiclly pose a problem ... i've never
seen anyone comment 
> that they can't do sane Range queries on their text
because the language 
> it's in doesn't collate in the same order as the
default compareTo.
>
>
>
>
> -Hoss
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>
> __________ NOD32 2605 (20071022) Information
__________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>
>   


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )