On Feb 26, 2007, at 10:41 AM, Miles Crawford wrote:
>> DO NOT use KS -- especially version 0.20_01 and
subsequent
>> releases -- with versions of Perl prior to 5.8.3.
Those Unicode
>> bugs are vicious and very
>
> Do not use KS or do not use KS with unicode content?
Does just the
> unicode fail to work or are there further-reaching
consequences?
Do not use KS. KS now converts everything to Unicode at the
front
end, and all text is handled internally as Unicode. For
instance, in
InvIndexer->add_doc, there's this:
for my $field_name ( keys %$doc ) {
next unless $utf8_fields->{$field_name};
utf8::upgrade( $doc->{$field_name} );
}
If you supply Latin-1 text, it will get changed to Unicode
by that
utf8::upgrade call. So, no matter what the source material,
KS will
be vulnerable to Perl's Unicode bugs.
You also get Unicode text back from KS, e.g. from Hits-
>fetch_hit_hashref(). However, if the Unicode text
contains no
characters outside of Latin-1, Perl can convert back and
forth
transparently and you shouldn't notice anything different.
The bottom line is that Latin-1 source material should work
without
you having to think about it, but you have to be using Perl
5.8.3 or
above.
But now, unlike KS prior to 0.20_01, if you want to supply
Unicode
text you've prepared yourself, things will work.
# invindexer.plx
my %doc = (
content => decode( 'KOI8-R', $source_bytes );
);
$invindexer->add_doc( %doc );
# searcher.cgi
while ( my $hit = $hits->fetch_hit_hashref ) {
$_ = encode( 'KOI8-R', $_ ) for values %$hit;
...
}
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|