Adam Dickmeiss wrote:
OK. I can see where the change in the most current siconv.c
handles an
orphaned diacritic, but now I am back to the original reason
I put the
field mark and subfield marks in the string I was
converting, and that
is incorrect escape sequences. The way the code is now,
yaz-iconv
doesn't check for a closing escape sequence at the end of a
string being
converted. Although the MARC encoding standard specifies
that opening
and closing escape sequences MUST be paired within a
subfield, I have
seen records that leave off the 2nd escape sequence to shift
the
character sets back to ASCII/ANSEL. The current version of
yaz-iconv
remains in the most recently set escape mode and allows
subsequent
strings to be converted using that mode. I understand the
reason for
doing this, but is there a way I can get the library to tell
me this
case exists without having to do a bunch of code on my own?
Gary
> Gary Anderson wrote:
>
>> Adam,
>> The specific case I am dealing with is a 670 field
containing only 1
>> subfield a that ends with text like: houses built
above 50<degree
>> mark><field mark>. Obviously, the
providing library has used the
>> wrong character for the degree mark. They used
0xea which is a
>> combining Angstrom diacritic when they should have
used 0xc0 - the
>> degree mark. The initial records are in MARC8
encoding. When I run
>> the translation for this, I end up with no errors,
but the diacritic
>> character now follows the field mark. What I am
interested in is a
>> way for the siconv library to catch this situation,
since applying a
>> diacritic to a control character should not be
allowed behavior.
>> In this vein, maybe you can enlighten me on a
question - I am running
>> a tag by tag conversion on records that I process.
Originally, I was
>> not including the field mark character in the
string sent to siconv.
>> I found, however, that there were some cases where
the conversion
>> state was left indeterminate, so I began to include
the field mark in
>> the input string. That seems to have fixed all of
the other problems
>> except this one. What is your recommended practice
for converting
>> records? Should I be including the field mark or
not?
>>
> You should not include the field mark. IMHO, the MARC
control chars
> are out of the scope of iconv . Firstly, they should
not be converted.
> Secondly, they may harm - in case "some"
character set has special
> meaning for them.
>
> And as I said. You'll be getting an error. The iconv
will now say
> "incomplete converted sequence" and you'll
have no chars left in the
> subfield. So you know something is fishy.
>
> It's been implemented by the yaz-iconv in case you need
sample to code
> look at.
>
> For this check out YAZ via CVS.
>
> / Adam
>
>> Gary
>>
>> Adam Dickmeiss wrote:
>>
>>> Adam Dickmeiss wrote:
>>>
>>>> Gary Anderson wrote:
>>>>
>>>>> I am not sure how this will help. In
the application, the last 2
>>>>> bytes of the data string are oxea and
0x1e - the diacritic and the
>>>>> record mark. yaz_iconv seems to drop
the diacritic because it
>>>>> doesn't have a trailing character, but
it does process the record
>>>>> mark. What I need is something that
will tell me that this case
>>>>> has occurred. It looks to me like yaz
just drops the diacritic.
>>>>
>>>>
>>>> I don't see a way the iconv interface could
tell you this. I'm
>>>> still a little confused, so forgive me for
asking,.. what is the
>>>> behavior you want? (keep the diacritic?)
>>>>
>>>
>>> In case you want *not* to keep the diacritic,
in other words you are
>>> asking to be notified about an error .. then
maybe it's best to use
>>> EINVAL because the iconv man page says:
>>>
>>> "EINVAL An incomplete multibyte
sequence has been encountered in
>>> the input."
>>>
>>> Case 1:
>>> So if you pass
>>> .. 0xEA
>>> you get EINVAL because no characters follow
0xEA (as far as iconv is
>>> concerned).
>>>
>>> Case 2: If you pass
>>>
>>> .. 0xEA 0x1E
>>> that would not return an error. In fact YAZ
currently converts this
>>> UTF-8:
>>> 0x1E 0xCC 0x8A
>>> because 0x1E is just a "character".
>>>
>>> Unfortunately for case 1, YAZ currently returns
'unknown error'.
>>> That's no good. This has been fixed in the CVS
version of YAZ.
>>>
>>> / Adam
>>>
>>>
>>>> / Adam
>>>>
>>>>>
>>>>> My checking indicates that on
completion of conversion of the
>>>>> record mark, the yaz_iconv library is
left in its 'initial
>>>>> state'. The next string converts just
fine.
>>>>> Gary
>>>>>
>>>>> Adam Dickmeiss wrote:
>>>>>
>>>>>> Gary Anderson wrote:
>>>>>>
>>>>>>> I am using the siconv
interface. I have a programmatic process
>>>>>>> that deals with very large
files of records.
>>>>>>>
>>>>>>> Adam Dickmeiss wrote:
>>>>>>>
>>>>>>>> Gary Anderson wrote:
>>>>>>>>
>>>>>>>>> I recently ran some
tests using records from the National
>>>>>>>>> Library of Canada. Of
the 600,000+ records in their name and
>>>>>>>>> subject authority file,
six records had 670 tags where the
>>>>>>>>> subfield a data ended
in a combining diacritic character with
>>>>>>>>> no following
character.
>>>>>>>>>
>>>>>>>>> Submitting that data
string
>>>>>>>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to
>>>>>>>>> siconvert resulted in
an output string that did not contain
>>>>>>>>> the diacritic
character. It was dropped. The field mark
>>>>>>>>> character was retained.
Can you suggest a means for notifying
>>>>>>>>> the caller when this
condition occurs? Byte counts don't
>>>>>>>>> really work because
UTF8 is one side or the other of the
>>>>>>>>> conversion
transaction.
>>>>>>>>>
>>>>>>>>> The ending diacritic
values were: 0xE2, 0xE5, 0xE8, 0xEA, and
>>>>>>>>> 0xF6.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> I think you need to do is to
"flush" reset to the "initial
>>>>>> state". The flush would take
place after a field or subfield ends.
>>>>>>
>>>>>> That's done by iconv and,
hopefully, yaz_iconv by setting inbuf
>>>>>> or *inbuf to NULL, but outbut to
non-NULL, i.e.
>>>>>>
>>>>>> yaz_iconv(cd, 0, 0, &outbuf,
&outbytesleft);
>>>>>>
>>>>>> From 'man 3 iconv':
>>>>>> "
>>>>>> A different case is when inbuf is
NULL or *inbuf is NULL, but
>>>>>> outbuf is
>>>>>> not NULL and *outbuf is not NULL.
In this case, the iconv()
>>>>>> function
>>>>>> attempts to set cd's conversion
state to the initial state and
>>>>>> store a
>>>>>> corresponding shift sequence at
*outbuf. At most *outbytesleft
>>>>>> bytes,
>>>>>> starting at *outbuf, will be
written. If the output buffer has
>>>>>> no more
>>>>>> room for this reset sequence, it
sets errno to E2BIG and
>>>>>> returns
>>>>>> (size_t)(-1). Otherwise it
increments *outbuf and decrements
>>>>>> *out-
>>>>>> bytesleft by the number of bytes
written.
>>>>>> "
>>>>>>
>>>>>> Use YAZ 2.1.48 or later for this to
work.
>>>>>>
>>>>>> / Adam
>>>>>>
>>>>>>>>>
>>>>>>>> Did you use yaz-marcdump
for the conversion?
>>>>>>>>
>>>>>>>> Or did you do something
else ? (such as programming towards the
>>>>>>>> siconv interface)?
>>>>>>>>
>>>>>>>> / Adam
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Gary
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> Yazlist mailing list
>>>>>>>>> Yazlist lists.indexdata.dk
>>>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
_______________________________________________
>>>>>>>> Yazlist mailing list
>>>>>>>> Yazlist lists.indexdata.dk
>>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>>>
>>>>>>>
>>>>>>>
_______________________________________________
>>>>>>> Yazlist mailing list
>>>>>>> Yazlist lists.indexdata.dk
>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
_______________________________________________
>>>>>> Yazlist mailing list
>>>>>> Yazlist lists.indexdata.dk
>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>
>>>>>
>>>>>
_______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlist lists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>>
>>>>
>>>>
>>>>
_______________________________________________
>>>> Yazlist mailing list
>>>> Yazlist lists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>>
>>>
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlist lists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
>
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|