List Info

Thread: Re: Fields ending in combining diacritics




Re: Fields ending in combining diacritics
country flaguser name
United States
2007-03-12 18:21:17
Adam Dickmeiss wrote:
OK.  I can see where the change in the most current siconv.c
handles an 
orphaned diacritic, but now I am back to the original reason
I put the 
field mark and subfield marks in the string I was
converting, and that 
is incorrect escape sequences.  The way the code is now,
yaz-iconv 
doesn't check for a closing escape sequence at the end of a
string being 
converted.  Although the MARC encoding standard specifies
that opening 
and closing escape sequences MUST be paired within a
subfield, I have 
seen records that leave off the 2nd escape sequence to shift
the 
character sets back to ASCII/ANSEL.  The current version of
yaz-iconv 
remains in the most recently set escape mode and allows
subsequent 
strings to be converted using that mode.  I understand the
reason for 
doing this, but is there a way I can get the library to tell
me this 
case exists without having to do a bunch of code on my own?
Gary

> Gary Anderson wrote:
>
>> Adam,
>> The specific case I am dealing with is a 670 field
containing only 1 
>> subfield a that ends with text like:  houses built
above 50<degree 
>> mark><field mark>.  Obviously, the
providing library has used the 
>> wrong character for the degree mark.  They used
0xea which is a 
>> combining Angstrom diacritic when they should have
used 0xc0 - the 
>> degree mark.  The initial records are in MARC8
encoding.  When I run 
>> the translation for this, I end up with no errors,
but the diacritic 
>> character now follows the field mark.  What I am
interested in is a 
>> way for the siconv library to catch this situation,
since applying a 
>> diacritic to a control character  should not be
allowed behavior.
>> In this vein, maybe you can enlighten me on a
question - I am running 
>> a tag by tag conversion on records that I process. 
Originally, I was 
>> not including the field mark character in the
string sent to siconv.  
>> I found, however, that there were some cases where
the conversion 
>> state was left indeterminate, so I began to include
the field mark in 
>> the input string.  That seems to have fixed all of
the other problems 
>> except this one.  What is your recommended practice
for converting 
>> records?  Should I be including the field mark or
not?
>>
> You should not include the field mark. IMHO, the MARC
control chars 
> are out of the scope of iconv . Firstly, they should
not be converted. 
> Secondly, they may harm - in case "some"
character set has special 
> meaning for them.
>
> And as I said. You'll be getting an error. The iconv
will now say 
> "incomplete converted sequence" and you'll
have no chars left in the 
> subfield. So you know something is fishy.
>
> It's been implemented by the yaz-iconv in case you need
sample to code 
> look at.
>
> For this check out YAZ via CVS.
>
> / Adam
>
>> Gary
>>
>> Adam Dickmeiss wrote:
>>
>>> Adam Dickmeiss wrote:
>>>
>>>> Gary Anderson wrote:
>>>>
>>>>> I am not sure how this will help.  In
the application, the last 2 
>>>>> bytes of the data string are oxea and
0x1e - the diacritic and the 
>>>>> record mark.  yaz_iconv seems to drop
the diacritic because it 
>>>>> doesn't have a trailing character, but
it does process the record 
>>>>> mark.  What I need is something that
will tell me that this case 
>>>>> has occurred.  It looks to me like yaz
just drops the diacritic.
>>>>
>>>>
>>>> I don't see a way the iconv interface could
tell you this. I'm 
>>>> still a little confused, so forgive me for
asking,.. what is the 
>>>> behavior you want? (keep the diacritic?)
>>>>
>>>
>>> In case you want *not* to keep the diacritic,
in other words you are 
>>> asking to be notified about an error .. then
maybe it's best to use 
>>> EINVAL because the iconv man page says:
>>>
>>> "EINVAL An  incomplete  multibyte 
sequence  has been encountered in 
>>> the input."
>>>
>>> Case 1:
>>> So if you pass
>>>    .. 0xEA
>>> you get EINVAL because no characters follow
0xEA (as far as iconv is 
>>> concerned).
>>>
>>> Case 2: If you pass
>>>
>>>    .. 0xEA 0x1E
>>> that would not return an error. In fact YAZ
currently converts this 
>>> UTF-8:
>>>       0x1E 0xCC 0x8A
>>> because 0x1E is just a "character".
>>>
>>> Unfortunately for case 1, YAZ currently returns
'unknown error'. 
>>> That's no good. This has been fixed in the CVS
version of YAZ.
>>>
>>> / Adam
>>>
>>>
>>>> / Adam
>>>>
>>>>>
>>>>> My checking indicates that on
completion of conversion of the 
>>>>> record mark, the yaz_iconv library is
left in its 'initial 
>>>>> state'.  The next string converts just
fine.
>>>>> Gary
>>>>>
>>>>> Adam Dickmeiss wrote:
>>>>>
>>>>>> Gary Anderson wrote:
>>>>>>
>>>>>>> I am using the siconv
interface.  I have a programmatic process 
>>>>>>> that deals with very large
files of records.
>>>>>>>
>>>>>>> Adam Dickmeiss wrote:
>>>>>>>
>>>>>>>> Gary Anderson wrote:
>>>>>>>>
>>>>>>>>> I recently ran some
tests using records from the National 
>>>>>>>>> Library of Canada.  Of
the 600,000+ records in their name and 
>>>>>>>>> subject authority file,
six records had 670 tags where the 
>>>>>>>>> subfield a data ended
in a combining diacritic character with 
>>>>>>>>> no following
character.
>>>>>>>>>
>>>>>>>>> Submitting that data
string 
>>>>>>>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to 
>>>>>>>>> siconvert resulted in
an output string that did not contain 
>>>>>>>>> the diacritic
character.  It was dropped.  The field mark 
>>>>>>>>> character was retained.
 Can you suggest a means for notifying 
>>>>>>>>> the caller when this
condition occurs?  Byte counts don't 
>>>>>>>>> really work because
UTF8 is one side or the other of the 
>>>>>>>>> conversion
transaction.
>>>>>>>>>
>>>>>>>>> The ending diacritic
values were:  0xE2, 0xE5, 0xE8, 0xEA, and 
>>>>>>>>> 0xF6.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> I think you need to do is to
"flush" reset to the "initial 
>>>>>> state". The flush would take
place after a field or subfield ends.
>>>>>>
>>>>>> That's done by iconv and,
hopefully, yaz_iconv by setting inbuf 
>>>>>> or *inbuf to NULL, but outbut to
non-NULL, i.e.
>>>>>>
>>>>>> yaz_iconv(cd, 0, 0, &outbuf,
&outbytesleft);
>>>>>>
>>>>>> From 'man 3 iconv':
>>>>>> "
>>>>>> A different case is when inbuf is
NULL or *inbuf is NULL, but 
>>>>>> outbuf is
>>>>>> not NULL and *outbuf is not NULL.
In this case,  the  iconv()  
>>>>>> function
>>>>>> attempts  to set cd's conversion
state to the initial state and 
>>>>>> store a
>>>>>> corresponding shift sequence at
*outbuf.  At most *outbytesleft  
>>>>>> bytes,
>>>>>> starting at *outbuf, will be
written.  If the output buffer has 
>>>>>> no more
>>>>>> room for this reset sequence,  it 
sets  errno  to  E2BIG  and  
>>>>>> returns
>>>>>> (size_t)(-1).  Otherwise  it 
increments  *outbuf  and decrements 
>>>>>> *out-
>>>>>> bytesleft by the number of bytes
written.
>>>>>> "
>>>>>>
>>>>>> Use YAZ 2.1.48 or later for this to
work.
>>>>>>
>>>>>> / Adam
>>>>>>
>>>>>>>>>
>>>>>>>> Did you use yaz-marcdump
for the conversion?
>>>>>>>>
>>>>>>>> Or did you do something
else ? (such as programming towards the 
>>>>>>>> siconv interface)?
>>>>>>>>
>>>>>>>> / Adam
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Gary
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> Yazlist mailing list
>>>>>>>>> Yazlistlists.indexdata.dk
>>>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
_______________________________________________
>>>>>>>> Yazlist mailing list
>>>>>>>> Yazlistlists.indexdata.dk
>>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>>>
>>>>>>>
>>>>>>>
_______________________________________________
>>>>>>> Yazlist mailing list
>>>>>>> Yazlistlists.indexdata.dk
>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
_______________________________________________
>>>>>> Yazlist mailing list
>>>>>> Yazlistlists.indexdata.dk
>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>
>>>>>
>>>>>
_______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlistlists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>>
>>>>
>>>>
>>>>
_______________________________________________
>>>> Yazlist mailing list
>>>> Yazlistlists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>>
>>>
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlistlists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlistlists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
>
>
> _______________________________________________
> Yazlist mailing list
> Yazlistlists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>


_______________________________________________
Yazlist mailing list
Yazlistlists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list

  
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )