|
List Info
Thread: Fields ending in combining diacritics
|
|
| Fields ending in combining diacritics |
  United States |
2007-03-07 14:05:15 |
I recently ran some tests using records from the National
Library of
Canada. Of the 600,000+ records in their name and subject
authority
file, six records had 670 tags where the subfield a data
ended in a
combining diacritic character with no following character.
Submitting that data string
(indicators+subfieldmark+subfieldcode+data+fieldmark) to
siconvert
resulted in an output string that did not contain the
diacritic
character. It was dropped. The field mark character was
retained. Can
you suggest a means for notifying the caller when this
condition
occurs? Byte counts don't really work because UTF8 is one
side or the
other of the conversion transaction.
The ending diacritic values were: 0xE2, 0xE5, 0xE8, 0xEA,
and 0xF6.
Thanks
Gary
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
|
| Re: Fields ending in combining
diacritics |
  Denmark |
2007-03-08 04:05:54 |
Gary Anderson wrote:
> I recently ran some tests using records from the
National Library of
> Canada. Of the 600,000+ records in their name and
subject authority
> file, six records had 670 tags where the subfield a
data ended in a
> combining diacritic character with no following
character.
>
> Submitting that data string
> (indicators+subfieldmark+subfieldcode+data+fieldmark)
to siconvert
> resulted in an output string that did not contain the
diacritic
> character. It was dropped. The field mark character
was retained. Can
> you suggest a means for notifying the caller when this
condition
> occurs? Byte counts don't really work because UTF8 is
one side or the
> other of the conversion transaction.
>
> The ending diacritic values were: 0xE2, 0xE5, 0xE8,
0xEA, and 0xF6.
>
Did you use yaz-marcdump for the conversion?
Or did you do something else ? (such as programming towards
the siconv
interface)?
/ Adam
> Thanks
> Gary
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
| Re: Fields ending in combining
diacritics |
  United States |
2007-03-08 13:44:06 |
I am using the siconv interface. I have a programmatic
process that
deals with very large files of records.
Adam Dickmeiss wrote:
> Gary Anderson wrote:
>
>> I recently ran some tests using records from the
National Library of
>> Canada. Of the 600,000+ records in their name and
subject authority
>> file, six records had 670 tags where the subfield a
data ended in a
>> combining diacritic character with no following
character.
>>
>> Submitting that data string
>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to
siconvert
>> resulted in an output string that did not contain
the diacritic
>> character. It was dropped. The field mark
character was retained.
>> Can you suggest a means for notifying the caller
when this condition
>> occurs? Byte counts don't really work because UTF8
is one side or
>> the other of the conversion transaction.
>>
>> The ending diacritic values were: 0xE2, 0xE5,
0xE8, 0xEA, and 0xF6.
>>
> Did you use yaz-marcdump for the conversion?
>
> Or did you do something else ? (such as programming
towards the siconv
> interface)?
>
> / Adam
>
>> Thanks
>> Gary
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
>
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
|
| Re: Fields ending in combining
diacritics |
  Denmark |
2007-03-08 16:07:29 |
Gary Anderson wrote:
> I am using the siconv interface. I have a programmatic
process that
> deals with very large files of records.
>
> Adam Dickmeiss wrote:
>
>> Gary Anderson wrote:
>>
>>> I recently ran some tests using records from
the National Library of
>>> Canada. Of the 600,000+ records in their name
and subject authority
>>> file, six records had 670 tags where the
subfield a data ended in a
>>> combining diacritic character with no following
character.
>>>
>>> Submitting that data string
>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to
siconvert
>>> resulted in an output string that did not
contain the diacritic
>>> character. It was dropped. The field mark
character was retained.
>>> Can you suggest a means for notifying the
caller when this condition
>>> occurs? Byte counts don't really work because
UTF8 is one side or
>>> the other of the conversion transaction.
>>>
>>> The ending diacritic values were: 0xE2, 0xE5,
0xE8, 0xEA, and 0xF6.
I think you need to do is to "flush" reset to the
"initial state". The
flush would take place after a field or subfield ends.
That's done by iconv and, hopefully, yaz_iconv by setting
inbuf or
*inbuf to NULL, but outbut to non-NULL, i.e.
yaz_iconv(cd, 0, 0, &outbuf, &outbytesleft);
From 'man 3 iconv':
"
A different case is when inbuf is NULL or *inbuf is NULL,
but outbuf is
not NULL and *outbuf is not NULL. In this case, the
iconv() function
attempts to set cd's conversion state to the initial state
and store a
corresponding shift sequence at *outbuf. At most
*outbytesleft bytes,
starting at *outbuf, will be written. If the output buffer
has no more
room for this reset sequence, it sets errno to E2BIG
and returns
(size_t)(-1). Otherwise it increments *outbuf and
decrements *out-
bytesleft by the number of bytes written.
"
Use YAZ 2.1.48 or later for this to work.
/ Adam
>>>
>> Did you use yaz-marcdump for the conversion?
>>
>> Or did you do something else ? (such as programming
towards the siconv
>> interface)?
>>
>> / Adam
>>
>>> Thanks
>>> Gary
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlist lists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
| Re: Fields ending in combining
diacritics |
  United States |
2007-03-08 16:25:51 |
I am not sure how this will help. In the application, the
last 2 bytes
of the data string are oxea and 0x1e - the diacritic and the
record
mark. yaz_iconv seems to drop the diacritic because it
doesn't have a
trailing character, but it does process the record mark.
What I need is
something that will tell me that this case has occurred. It
looks to me
like yaz just drops the diacritic.
My checking indicates that on completion of conversion of
the record
mark, the yaz_iconv library is left in its 'initial state'.
The next
string converts just fine.
Gary
Adam Dickmeiss wrote:
> Gary Anderson wrote:
>
>> I am using the siconv interface. I have a
programmatic process that
>> deals with very large files of records.
>>
>> Adam Dickmeiss wrote:
>>
>>> Gary Anderson wrote:
>>>
>>>> I recently ran some tests using records
from the National Library
>>>> of Canada. Of the 600,000+ records in
their name and subject
>>>> authority file, six records had 670 tags
where the subfield a data
>>>> ended in a combining diacritic character
with no following character.
>>>>
>>>> Submitting that data string
>>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to
siconvert
>>>> resulted in an output string that did not
contain the diacritic
>>>> character. It was dropped. The field mark
character was
>>>> retained. Can you suggest a means for
notifying the caller when
>>>> this condition occurs? Byte counts don't
really work because UTF8
>>>> is one side or the other of the conversion
transaction.
>>>>
>>>> The ending diacritic values were: 0xE2,
0xE5, 0xE8, 0xEA, and 0xF6.
>>>
>
> I think you need to do is to "flush" reset to
the "initial state". The
> flush would take place after a field or subfield ends.
>
> That's done by iconv and, hopefully, yaz_iconv by
setting inbuf or
> *inbuf to NULL, but outbut to non-NULL, i.e.
>
> yaz_iconv(cd, 0, 0, &outbuf, &outbytesleft);
>
> From 'man 3 iconv':
> "
> A different case is when inbuf is NULL or *inbuf is
NULL, but outbuf is
> not NULL and *outbuf is not NULL. In this case, the
iconv() function
> attempts to set cd's conversion state to the initial
state and store a
> corresponding shift sequence at *outbuf. At most
*outbytesleft bytes,
> starting at *outbuf, will be written. If the output
buffer has no more
> room for this reset sequence, it sets errno to
E2BIG and returns
> (size_t)(-1). Otherwise it increments *outbuf and
decrements *out-
> bytesleft by the number of bytes written.
> "
>
> Use YAZ 2.1.48 or later for this to work.
>
> / Adam
>
>>>>
>>> Did you use yaz-marcdump for the conversion?
>>>
>>> Or did you do something else ? (such as
programming towards the
>>> siconv interface)?
>>>
>>> / Adam
>>>
>>>> Thanks
>>>> Gary
>>>>
>>>>
_______________________________________________
>>>> Yazlist mailing list
>>>> Yazlist lists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>>
>>>
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlist lists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
>
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
|
[1-5]
|
|