List Info

Thread: Default tokenizer regex breaks unicode




Default tokenizer regex breaks unicode
user name
2006-09-07 20:03:13
On Sep 6, 2006, at 1:57 PM, via RT wrote:
> The default regex in KinoSearch::Analysis::Tokenizer
breaks unicode.

Thank you for the report.  Thank you especially for the test
case,  
which I will incorporate into KinoSearch's test suite.

The problem exposed by your test appears to be due to the
loss of the  
scalar's UTF8 flag as the text is absorbed into a  
KinoSearch::Analysis::TokenBatch object, then recreated
later.  By  
adding Encode::_utf8_on($_) at the right spot in
Tokenizer::analyze,  
we get the desired behavior in your test with the stock
English  
PolyAnalyzer.  Unfortunately, the TokenBatch bug is not the
only  
place where Unicode support does not work properly in
KinoSearch  
0.12/0.13.

All these issues were addressed a few weeks back, but there
has not  
yet been a release incorporating the changes.  The fix -- KS
now  
converts everything to Unicode for internal processing -- is
not  
backwards compatible, and so I'm trying to put together a
single 0.20  
release which aggregates multiple backwards-incompatible
changes.

I would appreciate it if you would try a recent version from
 
KinoSearch's subversion repository and see if it works
properly for  
you.  As of this email, the current repository revision is
1216,  
which I believe will work.  However, there has been quite a
bit of  
churn lately, and you may wish to try revision 1030.

svn co -r 1216 http:
//www.rectangular.com/svn/kinosearch/trunk  
kinosearch

Best,

--
Marvin Humphrey



_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )