List Info

Thread: Unicode problem




Unicode problem
user name
2008-03-03 10:43:29
There seems to be a problem with KinoSearch’s Unicode
support. Greek  
words can be listed in the index, but they always have a
doc_freq of  
0. The attached script demonstrates this problem. This is
the output  
it gives me:

Greek occurs in 1 document.
Hmm occurs in 1 document.
as occurs in 1 document.
in occurs in 1 document.
interesting occurs in 1 document.
or occurs in 1 document.
say occurs in 1 document.
they occurs in 1 document.
ἐνδιαφέρον occurs in 0 documents.

It didn’t give me any wide char warnings, so I looked into
it further  
and found that ‘ἐνδιαφέρον’ came out encoded
as UTF-8  
("341 
274 
220 
316 
275 
316 
264316271316261317206341275263317201316277316
275"), so  
maybe that’s part of the problem.




_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


  
Re: Unicode problem
country flaguser name
United States
2008-03-03 17:10:04
On Mar 3, 2008, at 8:43 AM, Father Chrysostomos wrote:

>  I looked into it further and found that
‘ἐνδιαφέρον’  
> came out encoded as UTF-8  
> ("341 
> 274 
> 220 
> 316 
> 275 
> 316 
>
264316271316261317206341275263317201316277316
275"),

If we isolate the original and use Devel::Peek to inspect
it...

   use Devel::Peek;
   my $greek = 'ἐνδιαφέρον';
   Dump($greek);

... this is what we see:

SV = PV(0x91e374) at 0x8972dc
   REFCNT = 1
   FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
   PV = 0x11742e0  
"341 
274 
220 
316 
275 
316 
264316271316261317206341275263317201316277316
275"  
[UTF8 "xx 
xxxxxxxx&qu
ot;]
   CUR = 22
   LEN = 24

Here's what's coming out of $lexicon->get_term:

SV = PV(0x91d0dc) at 0x912f80
   REFCNT = 1
   FLAGS = (PADBUSY,PADMY,POK,pPOK)
   PV = 0x117fce0  
"341 
274 
220 
316 
275 
316264316271316261317206341275263317201316277
316275"
   CUR = 22
   LEN = 24

The strings have the same byte sequence, but the second one
is missing  
the UTF8 flag, so Perl is interpreting it as Latin1.

When we submit that scalar to $reader->doc_freq, the XS
binding  
extracts the string using SvPVutf8, which causes the
supposedly Latin1  
string to be, ahem, "upgraded" to UTF8.  The
resulting garbage isn't  
in the index.

The problem was a missing SvUTF8_on in the XS binding for  
Lexicon_Get_Term.  Fixed by r3103.  Thanks for the report.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


Re: Unicode problem
user name
2008-03-04 16:43:59
On Mar 3, 2008, at 3:10 PM, Marvin Humphrey wrote:

> The problem was a missing SvUTF8_on in the XS binding
for  
> Lexicon_Get_Term.  Fixed by r3103.  Thanks for the
report.

Heres a test for it.




_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


  
Re: Unicode problem
user name
2008-03-11 13:57:53
I sent this a week ago. Did you get it?

On Mar 4, 2008, at 2:43 PM, Father Chrysostomos wrote:

>
> On Mar 3, 2008, at 3:10 PM, Marvin Humphrey wrote:
>
>> The problem was a missing SvUTF8_on in the XS
binding for  
>> Lexicon_Get_Term.  Fixed by r3103.  Thanks for the
report.
>
> Heres a test for it.
>

>


_______________________________________________
KinoSearch mailing list
KinoSearchrectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


  
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )