List Info

Thread: Re: Fields ending in combining diacritics




Re: Fields ending in combining diacritics
country flaguser name
Denmark
2007-03-09 01:07:21
Gary Anderson wrote:
> I am not sure how this will help.  In the application,
the last 2 bytes 
> of the data string are oxea and 0x1e - the diacritic
and the record 
> mark.  yaz_iconv seems to drop the diacritic because it
doesn't have a 
> trailing character, but it does process the record
mark.  What I need is 
> something that will tell me that this case has
occurred.  It looks to me 
> like yaz just drops the diacritic.
I don't see a way the iconv interface could tell you this.
I'm still a 
little confused, so forgive me for asking,.. what is the
behavior you 
want? (keep the diacritic?)

/ Adam

> 
> My checking indicates that on completion of conversion
of the record 
> mark, the yaz_iconv library is left in its 'initial
state'.  The next 
> string converts just fine.
> Gary
> 
> Adam Dickmeiss wrote:
> 
>> Gary Anderson wrote:
>>
>>> I am using the siconv interface.  I have a
programmatic process that 
>>> deals with very large files of records.
>>>
>>> Adam Dickmeiss wrote:
>>>
>>>> Gary Anderson wrote:
>>>>
>>>>> I recently ran some tests using records
from the National Library 
>>>>> of Canada.  Of the 600,000+ records in
their name and subject 
>>>>> authority file, six records had 670
tags where the subfield a data 
>>>>> ended in a combining diacritic
character with no following character.
>>>>>
>>>>> Submitting that data string 
>>>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to
siconvert 
>>>>> resulted in an output string that did
not contain the diacritic 
>>>>> character.  It was dropped.  The field
mark character was 
>>>>> retained.  Can you suggest a means for
notifying the caller when 
>>>>> this condition occurs?  Byte counts
don't really work because UTF8 
>>>>> is one side or the other of the
conversion transaction.
>>>>>
>>>>> The ending diacritic values were: 
0xE2, 0xE5, 0xE8, 0xEA, and 0xF6.
>>>>
>>
>> I think you need to do is to "flush"
reset to the "initial state". The 
>> flush would take place after a field or subfield
ends.
>>
>> That's done by iconv and, hopefully, yaz_iconv by
setting inbuf or 
>> *inbuf to NULL, but outbut to non-NULL, i.e.
>>
>> yaz_iconv(cd, 0, 0, &outbuf,
&outbytesleft);
>>
>> From 'man 3 iconv':
>> "
>> A different case is when inbuf is NULL or *inbuf is
NULL, but outbuf is
>> not NULL and *outbuf is not NULL. In this case, 
the  iconv()  function
>> attempts  to set cd's conversion state to the
initial state and store a
>> corresponding shift sequence at *outbuf.  At most
*outbytesleft  bytes,
>> starting at *outbuf, will be written.  If the
output buffer has no more
>> room for this reset sequence,  it  sets  errno  to 
E2BIG  and  returns
>> (size_t)(-1).  Otherwise  it  increments  *outbuf 
and decrements *out-
>> bytesleft by the number of bytes written.
>> "
>>
>> Use YAZ 2.1.48 or later for this to work.
>>
>> / Adam
>>
>>>>>
>>>> Did you use yaz-marcdump for the
conversion?
>>>>
>>>> Or did you do something else ? (such as
programming towards the 
>>>> siconv interface)?
>>>>
>>>> / Adam
>>>>
>>>>> Thanks
>>>>> Gary
>>>>>
>>>>>
_______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlistlists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>>
>>>>
>>>>
>>>>
_______________________________________________
>>>> Yazlist mailing list
>>>> Yazlistlists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlistlists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlistlists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
> 
> _______________________________________________
> Yazlist mailing list
> Yazlistlists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list


_______________________________________________
Yazlist mailing list
Yazlistlists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list

Re: Fields ending in combining diacritics
country flaguser name
Denmark
2007-03-09 02:40:50
Adam Dickmeiss wrote:
> Gary Anderson wrote:
>> I am not sure how this will help.  In the
application, the last 2 
>> bytes of the data string are oxea and 0x1e - the
diacritic and the 
>> record mark.  yaz_iconv seems to drop the diacritic
because it doesn't 
>> have a trailing character, but it does process the
record mark.  What 
>> I need is something that will tell me that this
case has occurred.  It 
>> looks to me like yaz just drops the diacritic.
> I don't see a way the iconv interface could tell you
this. I'm still a 
> little confused, so forgive me for asking,.. what is
the behavior you 
> want? (keep the diacritic?)
> 

In case you want *not* to keep the diacritic, in other words
you are 
asking to be notified about an error .. then maybe it's best
to use 
EINVAL because the iconv man page says:

"EINVAL An  incomplete  multibyte  sequence  has been
encountered in the 
input."

Case 1:
So if you pass
    .. 0xEA
you get EINVAL because no characters follow 0xEA (as far as
iconv is 
concerned).

Case 2: If you pass

    .. 0xEA 0x1E
that would not return an error. In fact YAZ currently
converts this UTF-8:
       0x1E 0xCC 0x8A
because 0x1E is just a "character".

Unfortunately for case 1, YAZ currently returns 'unknown
error'. That's 
no good. This has been fixed in the CVS version of YAZ.

/ Adam


> / Adam
> 
>>
>> My checking indicates that on completion of
conversion of the record 
>> mark, the yaz_iconv library is left in its 'initial
state'.  The next 
>> string converts just fine.
>> Gary
>>
>> Adam Dickmeiss wrote:
>>
>>> Gary Anderson wrote:
>>>
>>>> I am using the siconv interface.  I have a
programmatic process that 
>>>> deals with very large files of records.
>>>>
>>>> Adam Dickmeiss wrote:
>>>>
>>>>> Gary Anderson wrote:
>>>>>
>>>>>> I recently ran some tests using
records from the National Library 
>>>>>> of Canada.  Of the 600,000+ records
in their name and subject 
>>>>>> authority file, six records had 670
tags where the subfield a data 
>>>>>> ended in a combining diacritic
character with no following character.
>>>>>>
>>>>>> Submitting that data string 
>>>>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to
siconvert 
>>>>>> resulted in an output string that
did not contain the diacritic 
>>>>>> character.  It was dropped.  The
field mark character was 
>>>>>> retained.  Can you suggest a means
for notifying the caller when 
>>>>>> this condition occurs?  Byte counts
don't really work because UTF8 
>>>>>> is one side or the other of the
conversion transaction.
>>>>>>
>>>>>> The ending diacritic values were: 
0xE2, 0xE5, 0xE8, 0xEA, and 0xF6.
>>>>>
>>>
>>> I think you need to do is to "flush"
reset to the "initial state". 
>>> The flush would take place after a field or
subfield ends.
>>>
>>> That's done by iconv and, hopefully, yaz_iconv
by setting inbuf or 
>>> *inbuf to NULL, but outbut to non-NULL, i.e.
>>>
>>> yaz_iconv(cd, 0, 0, &outbuf,
&outbytesleft);
>>>
>>> From 'man 3 iconv':
>>> "
>>> A different case is when inbuf is NULL or
*inbuf is NULL, but outbuf is
>>> not NULL and *outbuf is not NULL. In this case,
 the  iconv()  function
>>> attempts  to set cd's conversion state to the
initial state and store a
>>> corresponding shift sequence at *outbuf.  At
most *outbytesleft  bytes,
>>> starting at *outbuf, will be written.  If the
output buffer has no more
>>> room for this reset sequence,  it  sets  errno 
to  E2BIG  and  returns
>>> (size_t)(-1).  Otherwise  it  increments 
*outbuf  and decrements *out-
>>> bytesleft by the number of bytes written.
>>> "
>>>
>>> Use YAZ 2.1.48 or later for this to work.
>>>
>>> / Adam
>>>
>>>>>>
>>>>> Did you use yaz-marcdump for the
conversion?
>>>>>
>>>>> Or did you do something else ? (such as
programming towards the 
>>>>> siconv interface)?
>>>>>
>>>>> / Adam
>>>>>
>>>>>> Thanks
>>>>>> Gary
>>>>>>
>>>>>>
_______________________________________________
>>>>>> Yazlist mailing list
>>>>>> Yazlistlists.indexdata.dk
>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
_______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlistlists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>
>>>>
>>>>
_______________________________________________
>>>> Yazlist mailing list
>>>> Yazlistlists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>>
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlistlists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlistlists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
> 
> 
> _______________________________________________
> Yazlist mailing list
> Yazlistlists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
> 
> 


_______________________________________________
Yazlist mailing list
Yazlistlists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )