List Info

Thread: Re: Fields ending in combining diacritics




Re: Fields ending in combining diacritics
country flaguser name
United States
2007-03-12 12:15:41
Adam,
The specific case I am dealing with is a 670 field
containing only 1 
subfield a that ends with text like:  houses built above
50<degree 
mark><field mark>.  Obviously, the providing
library has used the wrong 
character for the degree mark.  They used 0xea which is a
combining 
Angstrom diacritic when they should have used 0xc0 - the
degree mark.  
The initial records are in MARC8 encoding.  When I run the
translation 
for this, I end up with no errors, but the diacritic
character now 
follows the field mark.  What I am interested in is a way
for the siconv 
library to catch this situation, since applying a diacritic
to a control 
character  should not be allowed behavior. 

In this vein, maybe you can enlighten me on a question - I
am running a 
tag by tag conversion on records that I process. 
Originally, I was not 
including the field mark character in the string sent to
siconv.  I 
found, however, that there were some cases where the
conversion state 
was left indeterminate, so I began to include the field mark
in the 
input string.  That seems to have fixed all of the other
problems except 
this one.  What is your recommended practice for converting
records?  
Should I be including the field mark or not?

Gary

Adam Dickmeiss wrote:

> Adam Dickmeiss wrote:
>
>> Gary Anderson wrote:
>>
>>> I am not sure how this will help.  In the
application, the last 2 
>>> bytes of the data string are oxea and 0x1e -
the diacritic and the 
>>> record mark.  yaz_iconv seems to drop the
diacritic because it 
>>> doesn't have a trailing character, but it does
process the record 
>>> mark.  What I need is something that will tell
me that this case has 
>>> occurred.  It looks to me like yaz just drops
the diacritic.
>>
>> I don't see a way the iconv interface could tell
you this. I'm still 
>> a little confused, so forgive me for asking,.. what
is the behavior 
>> you want? (keep the diacritic?)
>>
>
> In case you want *not* to keep the diacritic, in other
words you are 
> asking to be notified about an error .. then maybe it's
best to use 
> EINVAL because the iconv man page says:
>
> "EINVAL An  incomplete  multibyte  sequence  has
been encountered in 
> the input."
>
> Case 1:
> So if you pass
>    .. 0xEA
> you get EINVAL because no characters follow 0xEA (as
far as iconv is 
> concerned).
>
> Case 2: If you pass
>
>    .. 0xEA 0x1E
> that would not return an error. In fact YAZ currently
converts this 
> UTF-8:
>       0x1E 0xCC 0x8A
> because 0x1E is just a "character".
>
> Unfortunately for case 1, YAZ currently returns
'unknown error'. 
> That's no good. This has been fixed in the CVS version
of YAZ.
>
> / Adam
>
>
>> / Adam
>>
>>>
>>> My checking indicates that on completion of
conversion of the record 
>>> mark, the yaz_iconv library is left in its
'initial state'.  The 
>>> next string converts just fine.
>>> Gary
>>>
>>> Adam Dickmeiss wrote:
>>>
>>>> Gary Anderson wrote:
>>>>
>>>>> I am using the siconv interface.  I
have a programmatic process 
>>>>> that deals with very large files of
records.
>>>>>
>>>>> Adam Dickmeiss wrote:
>>>>>
>>>>>> Gary Anderson wrote:
>>>>>>
>>>>>>> I recently ran some tests using
records from the National 
>>>>>>> Library of Canada.  Of the
600,000+ records in their name and 
>>>>>>> subject authority file, six
records had 670 tags where the 
>>>>>>> subfield a data ended in a
combining diacritic character with no 
>>>>>>> following character.
>>>>>>>
>>>>>>> Submitting that data string 
>>>>>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to 
>>>>>>> siconvert resulted in an output
string that did not contain the 
>>>>>>> diacritic character.  It was
dropped.  The field mark character 
>>>>>>> was retained.  Can you suggest
a means for notifying the caller 
>>>>>>> when this condition occurs? 
Byte counts don't really work 
>>>>>>> because UTF8 is one side or the
other of the conversion 
>>>>>>> transaction.
>>>>>>>
>>>>>>> The ending diacritic values
were:  0xE2, 0xE5, 0xE8, 0xEA, and 
>>>>>>> 0xF6.
>>>>>>
>>>>>>
>>>>
>>>> I think you need to do is to
"flush" reset to the "initial state". 
>>>> The flush would take place after a field or
subfield ends.
>>>>
>>>> That's done by iconv and, hopefully,
yaz_iconv by setting inbuf or 
>>>> *inbuf to NULL, but outbut to non-NULL,
i.e.
>>>>
>>>> yaz_iconv(cd, 0, 0, &outbuf,
&outbytesleft);
>>>>
>>>> From 'man 3 iconv':
>>>> "
>>>> A different case is when inbuf is NULL or
*inbuf is NULL, but 
>>>> outbuf is
>>>> not NULL and *outbuf is not NULL. In this
case,  the  iconv()  
>>>> function
>>>> attempts  to set cd's conversion state to
the initial state and 
>>>> store a
>>>> corresponding shift sequence at *outbuf. 
At most *outbytesleft  
>>>> bytes,
>>>> starting at *outbuf, will be written.  If
the output buffer has no 
>>>> more
>>>> room for this reset sequence,  it  sets 
errno  to  E2BIG  and  
>>>> returns
>>>> (size_t)(-1).  Otherwise  it  increments 
*outbuf  and decrements 
>>>> *out-
>>>> bytesleft by the number of bytes written.
>>>> "
>>>>
>>>> Use YAZ 2.1.48 or later for this to work.
>>>>
>>>> / Adam
>>>>
>>>>>>>
>>>>>> Did you use yaz-marcdump for the
conversion?
>>>>>>
>>>>>> Or did you do something else ?
(such as programming towards the 
>>>>>> siconv interface)?
>>>>>>
>>>>>> / Adam
>>>>>>
>>>>>>> Thanks
>>>>>>> Gary
>>>>>>>
>>>>>>>
_______________________________________________
>>>>>>> Yazlist mailing list
>>>>>>> Yazlistlists.indexdata.dk
>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
_______________________________________________
>>>>>> Yazlist mailing list
>>>>>> Yazlistlists.indexdata.dk
>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>
>>>>>
>>>>>
_______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlistlists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>>
>>>>
>>>>
>>>>
_______________________________________________
>>>> Yazlist mailing list
>>>> Yazlistlists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlistlists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlistlists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>>
>
>
> _______________________________________________
> Yazlist mailing list
> Yazlistlists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>


_______________________________________________
Yazlist mailing list
Yazlistlists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list

  
Re: Fields ending in combining diacritics
country flaguser name
Denmark
2007-03-12 13:36:24
Gary Anderson wrote:
> Adam,
> The specific case I am dealing with is a 670 field
containing only 1 
> subfield a that ends with text like:  houses built
above 50<degree 
> mark><field mark>.  Obviously, the providing
library has used the wrong 
> character for the degree mark.  They used 0xea which is
a combining 
> Angstrom diacritic when they should have used 0xc0 -
the degree mark.  
> The initial records are in MARC8 encoding.  When I run
the translation 
> for this, I end up with no errors, but the diacritic
character now 
> follows the field mark.  What I am interested in is a
way for the siconv 
> library to catch this situation, since applying a
diacritic to a control 
> character  should not be allowed behavior.
> In this vein, maybe you can enlighten me on a question
- I am running a 
> tag by tag conversion on records that I process. 
Originally, I was not 
> including the field mark character in the string sent
to siconv.  I 
> found, however, that there were some cases where the
conversion state 
> was left indeterminate, so I began to include the field
mark in the 
> input string.  That seems to have fixed all of the
other problems except 
> this one.  What is your recommended practice for
converting records?  
> Should I be including the field mark or not?
> 
You should not include the field mark. IMHO, the MARC
control chars are 
out of the scope of iconv . Firstly, they should not be
converted. 
Secondly, they may harm - in case "some" character
set has special 
meaning for them.

And as I said. You'll be getting an error. The iconv will
now say 
"incomplete converted sequence" and you'll have no
chars left in the 
subfield. So you know something is fishy.

It's been implemented by the yaz-iconv in case you need
sample to code 
look at.

For this check out YAZ via CVS.

/ Adam

> Gary
> 
> Adam Dickmeiss wrote:
> 
>> Adam Dickmeiss wrote:
>>
>>> Gary Anderson wrote:
>>>
>>>> I am not sure how this will help.  In the
application, the last 2 
>>>> bytes of the data string are oxea and 0x1e
- the diacritic and the 
>>>> record mark.  yaz_iconv seems to drop the
diacritic because it 
>>>> doesn't have a trailing character, but it
does process the record 
>>>> mark.  What I need is something that will
tell me that this case has 
>>>> occurred.  It looks to me like yaz just
drops the diacritic.
>>>
>>> I don't see a way the iconv interface could
tell you this. I'm still 
>>> a little confused, so forgive me for asking,..
what is the behavior 
>>> you want? (keep the diacritic?)
>>>
>>
>> In case you want *not* to keep the diacritic, in
other words you are 
>> asking to be notified about an error .. then maybe
it's best to use 
>> EINVAL because the iconv man page says:
>>
>> "EINVAL An  incomplete  multibyte  sequence 
has been encountered in 
>> the input."
>>
>> Case 1:
>> So if you pass
>>    .. 0xEA
>> you get EINVAL because no characters follow 0xEA
(as far as iconv is 
>> concerned).
>>
>> Case 2: If you pass
>>
>>    .. 0xEA 0x1E
>> that would not return an error. In fact YAZ
currently converts this 
>> UTF-8:
>>       0x1E 0xCC 0x8A
>> because 0x1E is just a "character".
>>
>> Unfortunately for case 1, YAZ currently returns
'unknown error'. 
>> That's no good. This has been fixed in the CVS
version of YAZ.
>>
>> / Adam
>>
>>
>>> / Adam
>>>
>>>>
>>>> My checking indicates that on completion of
conversion of the record 
>>>> mark, the yaz_iconv library is left in its
'initial state'.  The 
>>>> next string converts just fine.
>>>> Gary
>>>>
>>>> Adam Dickmeiss wrote:
>>>>
>>>>> Gary Anderson wrote:
>>>>>
>>>>>> I am using the siconv interface.  I
have a programmatic process 
>>>>>> that deals with very large files of
records.
>>>>>>
>>>>>> Adam Dickmeiss wrote:
>>>>>>
>>>>>>> Gary Anderson wrote:
>>>>>>>
>>>>>>>> I recently ran some tests
using records from the National 
>>>>>>>> Library of Canada.  Of the
600,000+ records in their name and 
>>>>>>>> subject authority file, six
records had 670 tags where the 
>>>>>>>> subfield a data ended in a
combining diacritic character with no 
>>>>>>>> following character.
>>>>>>>>
>>>>>>>> Submitting that data string

>>>>>>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to 
>>>>>>>> siconvert resulted in an
output string that did not contain the 
>>>>>>>> diacritic character.  It
was dropped.  The field mark character 
>>>>>>>> was retained.  Can you
suggest a means for notifying the caller 
>>>>>>>> when this condition occurs?
 Byte counts don't really work 
>>>>>>>> because UTF8 is one side or
the other of the conversion 
>>>>>>>> transaction.
>>>>>>>>
>>>>>>>> The ending diacritic values
were:  0xE2, 0xE5, 0xE8, 0xEA, and 
>>>>>>>> 0xF6.
>>>>>>>
>>>>>>>
>>>>>
>>>>> I think you need to do is to
"flush" reset to the "initial state". 
>>>>> The flush would take place after a
field or subfield ends.
>>>>>
>>>>> That's done by iconv and, hopefully,
yaz_iconv by setting inbuf or 
>>>>> *inbuf to NULL, but outbut to non-NULL,
i.e.
>>>>>
>>>>> yaz_iconv(cd, 0, 0, &outbuf,
&outbytesleft);
>>>>>
>>>>> From 'man 3 iconv':
>>>>> "
>>>>> A different case is when inbuf is NULL
or *inbuf is NULL, but 
>>>>> outbuf is
>>>>> not NULL and *outbuf is not NULL. In
this case,  the  iconv()  
>>>>> function
>>>>> attempts  to set cd's conversion state
to the initial state and 
>>>>> store a
>>>>> corresponding shift sequence at
*outbuf.  At most *outbytesleft  
>>>>> bytes,
>>>>> starting at *outbuf, will be written. 
If the output buffer has no 
>>>>> more
>>>>> room for this reset sequence,  it  sets
 errno  to  E2BIG  and  
>>>>> returns
>>>>> (size_t)(-1).  Otherwise  it 
increments  *outbuf  and decrements 
>>>>> *out-
>>>>> bytesleft by the number of bytes
written.
>>>>> "
>>>>>
>>>>> Use YAZ 2.1.48 or later for this to
work.
>>>>>
>>>>> / Adam
>>>>>
>>>>>>>>
>>>>>>> Did you use yaz-marcdump for
the conversion?
>>>>>>>
>>>>>>> Or did you do something else ?
(such as programming towards the 
>>>>>>> siconv interface)?
>>>>>>>
>>>>>>> / Adam
>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Gary
>>>>>>>>
>>>>>>>>
_______________________________________________
>>>>>>>> Yazlist mailing list
>>>>>>>> Yazlistlists.indexdata.dk
>>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
_______________________________________________
>>>>>>> Yazlist mailing list
>>>>>>> Yazlistlists.indexdata.dk
>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>>
>>>>>>
>>>>>>
_______________________________________________
>>>>>> Yazlist mailing list
>>>>>> Yazlistlists.indexdata.dk
>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
_______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlistlists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>
>>>>
>>>>
_______________________________________________
>>>> Yazlist mailing list
>>>> Yazlistlists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>>
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlistlists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>>
>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlistlists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
> 
> _______________________________________________
> Yazlist mailing list
> Yazlistlists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list


_______________________________________________
Yazlist mailing list
Yazlistlists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )