On Sep 6, 2006, at 1:57 PM, via RT wrote:
> The default regex in KinoSearch::Analysis::Tokenizer
breaks unicode.
Thank you for the report. Thank you especially for the test
case,
which I will incorporate into KinoSearch's test suite.
The problem exposed by your test appears to be due to the
loss of the
scalar's UTF8 flag as the text is absorbed into a
KinoSearch::Analysis::TokenBatch object, then recreated
later. By
adding Encode::_utf8_on($_) at the right spot in
Tokenizer::analyze,
we get the desired behavior in your test with the stock
English
PolyAnalyzer. Unfortunately, the TokenBatch bug is not the
only
place where Unicode support does not work properly in
KinoSearch
0.12/0.13.
All these issues were addressed a few weeks back, but there
has not
yet been a release incorporating the changes. The fix -- KS
now
converts everything to Unicode for internal processing -- is
not
backwards compatible, and so I'm trying to put together a
single 0.20
release which aggregates multiple backwards-incompatible
changes.
I would appreciate it if you would try a recent version from
KinoSearch's subversion repository and see if it works
properly for
you. As of this email, the current repository revision is
1216,
which I believe will work. However, there has been quite a
bit of
churn lately, and you may wish to try revision 1030.
svn co -r 1216 http:
//www.rectangular.com/svn/kinosearch/trunk
kinosearch
Best,
--
Marvin Humphrey
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|