List Info

Thread: Malformed UTF8 characters




Malformed UTF8 characters
country flaguser name
United States
2007-02-08 15:11:30
I notice that the yaz_read_UTF8_char function in siconv.c is
only 
checking the beginning byte of a multy-byte sequence.  I
have found 
times when an incorrectly formed subsequent byte comes
along.  I suggest 
that the following modification be made to each of your
'else if' 
statements in this function.  Two examples are shown:

(beginning about line 227)
    else if (inp[0] <= 0xdf && inbytesleft >=
2)  // 2-byte character
    {
        if ((inp[1] & 0xc0) != 0x80) // check for
malformed 2nd character
        {
            *no_read = 0;
            *error = YAZ_ICONV_EILSEQ;
        } else {
           x = ((inp[0] & 0x1f) << 6) | (inp[1]
& 0x3f);
           if (x >= 0x80)
               *no_read = 2;
           else
           {
               *no_read = 0;
               *error = YAZ_ICONV_EILSEQ;
           }
        }

beginning about line 244
    else if (inp[0] <= 0xef && inbytesleft >=
3)  // 3-byte character
    {
        if (((inp[1] & 0xc0) != 0x80) || ((inp[2] &
0xc0) != 0x80)) // 
check for malformed 2nd & 3rd characters
        {
            *no_read = 0;
            *error = YAZ_ICONV_EILSEQ;
        } else {
           x = ((inp[0] & 0x1f) << 6) | (inp[1]
& 0x3f);
           x = ((inp[0] & 0x0f) << 12) | ((inp[1]
& 0x3f) << 6) |
               (inp[2] & 0x3f);
           if (x >= 0x800)
               *no_read = 3;
           else
           {
               *no_read = 0;
               *error = YAZ_ICONV_EILSEQ;
           }
        }
    }

May I suggest that this simple check for malformed trailing
characters 
be added to your code for each of the 5 cases where a
trailing character 
can exist?

Just another observation - I think the Unicode Consortium
has officially 
stated that UTF8 encoding will never be longer than 4
bytes.

Gary.

_______________________________________________
Yazlist mailing list
Yazlistlists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list

  
Re: Malformed UTF8 characters
country flaguser name
Denmark
2007-02-12 04:26:10
Gary Anderson wrote:
> I notice that the yaz_read_UTF8_char function in
siconv.c is only 
> checking the beginning byte of a multy-byte sequence. 
I have found 
> times when an incorrectly formed subsequent byte comes
along.  I suggest 
> that the following modification be made to each of your
'else if' 
> statements in this function.  Two examples are shown:

Gary, thanks for your proposed changes. I added them in our
YAZ bugzilla

http
://bugzilla.indexdata.dk/show_bug.cgi?id=887

and hope that the core development team will soon have a
look at your 
proposal, to integrate it into YAZ.

Your's Marc Cromme

-- 

Marc Cromme
M.Sc and Ph.D in Mathematical Modelling and Computation
Senior Developer, Project Manager

Index Data Aps
Købmagergade 43, 2
1150 Copenhagen K.
Denmark

tel: +45 3341 0100
fax: +45 3341 0101

http://www.indexdata.com


INDEX DATA Means Business
for Open Source and Open Standards





_______________________________________________
Yazlist mailing list
Yazlistlists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )