|
List Info
Thread: Re: Fields ending in combining diacritics
|
|
| Re: Fields ending in combining
diacritics |
  United States |
2007-03-12 12:15:41 |
Adam,
The specific case I am dealing with is a 670 field
containing only 1
subfield a that ends with text like: houses built above
50<degree
mark><field mark>. Obviously, the providing
library has used the wrong
character for the degree mark. They used 0xea which is a
combining
Angstrom diacritic when they should have used 0xc0 - the
degree mark.
The initial records are in MARC8 encoding. When I run the
translation
for this, I end up with no errors, but the diacritic
character now
follows the field mark. What I am interested in is a way
for the siconv
library to catch this situation, since applying a diacritic
to a control
character should not be allowed behavior.
In this vein, maybe you can enlighten me on a question - I
am running a
tag by tag conversion on records that I process.
Originally, I was not
including the field mark character in the string sent to
siconv. I
found, however, that there were some cases where the
conversion state
was left indeterminate, so I began to include the field mark
in the
input string. That seems to have fixed all of the other
problems except
this one. What is your recommended practice for converting
records?
Should I be including the field mark or not?
Gary
Adam Dickmeiss wrote:
> Adam Dickmeiss wrote:
>
>> Gary Anderson wrote:
>>
>>> I am not sure how this will help. In the
application, the last 2
>>> bytes of the data string are oxea and 0x1e -
the diacritic and the
>>> record mark. yaz_iconv seems to drop the
diacritic because it
>>> doesn't have a trailing character, but it does
process the record
>>> mark. What I need is something that will tell
me that this case has
>>> occurred. It looks to me like yaz just drops
the diacritic.
>>
>> I don't see a way the iconv interface could tell
you this. I'm still
>> a little confused, so forgive me for asking,.. what
is the behavior
>> you want? (keep the diacritic?)
>>
>
> In case you want *not* to keep the diacritic, in other
words you are
> asking to be notified about an error .. then maybe it's
best to use
> EINVAL because the iconv man page says:
>
> "EINVAL An incomplete multibyte sequence has
been encountered in
> the input."
>
> Case 1:
> So if you pass
> .. 0xEA
> you get EINVAL because no characters follow 0xEA (as
far as iconv is
> concerned).
>
> Case 2: If you pass
>
> .. 0xEA 0x1E
> that would not return an error. In fact YAZ currently
converts this
> UTF-8:
> 0x1E 0xCC 0x8A
> because 0x1E is just a "character".
>
> Unfortunately for case 1, YAZ currently returns
'unknown error'.
> That's no good. This has been fixed in the CVS version
of YAZ.
>
> / Adam
>
>
>> / Adam
>>
>>>
>>> My checking indicates that on completion of
conversion of the record
>>> mark, the yaz_iconv library is left in its
'initial state'. The
>>> next string converts just fine.
>>> Gary
>>>
>>> Adam Dickmeiss wrote:
>>>
>>>> Gary Anderson wrote:
>>>>
>>>>> I am using the siconv interface. I
have a programmatic process
>>>>> that deals with very large files of
records.
>>>>>
>>>>> Adam Dickmeiss wrote:
>>>>>
>>>>>> Gary Anderson wrote:
>>>>>>
>>>>>>> I recently ran some tests using
records from the National
>>>>>>> Library of Canada. Of the
600,000+ records in their name and
>>>>>>> subject authority file, six
records had 670 tags where the
>>>>>>> subfield a data ended in a
combining diacritic character with no
>>>>>>> following character.
>>>>>>>
>>>>>>> Submitting that data string
>>>>>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to
>>>>>>> siconvert resulted in an output
string that did not contain the
>>>>>>> diacritic character. It was
dropped. The field mark character
>>>>>>> was retained. Can you suggest
a means for notifying the caller
>>>>>>> when this condition occurs?
Byte counts don't really work
>>>>>>> because UTF8 is one side or the
other of the conversion
>>>>>>> transaction.
>>>>>>>
>>>>>>> The ending diacritic values
were: 0xE2, 0xE5, 0xE8, 0xEA, and
>>>>>>> 0xF6.
>>>>>>
>>>>>>
>>>>
>>>> I think you need to do is to
"flush" reset to the "initial state".
>>>> The flush would take place after a field or
subfield ends.
>>>>
>>>> That's done by iconv and, hopefully,
yaz_iconv by setting inbuf or
>>>> *inbuf to NULL, but outbut to non-NULL,
i.e.
>>>>
>>>> yaz_iconv(cd, 0, 0, &outbuf,
&outbytesleft);
>>>>
>>>> From 'man 3 iconv':
>>>> "
>>>> A different case is when inbuf is NULL or
*inbuf is NULL, but
>>>> outbuf is
>>>> not NULL and *outbuf is not NULL. In this
case, the iconv()
>>>> function
>>>> attempts to set cd's conversion state to
the initial state and
>>>> store a
>>>> corresponding shift sequence at *outbuf.
At most *outbytesleft
>>>> bytes,
>>>> starting at *outbuf, will be written. If
the output buffer has no
>>>> more
>>>> room for this reset sequence, it sets
errno to E2BIG and
>>>> returns
>>>> (size_t)(-1). Otherwise it increments
*outbuf and decrements
>>>> *out-
>>>> bytesleft by the number of bytes written.
>>>> "
>>>>
>>>> Use YAZ 2.1.48 or later for this to work.
>>>>
>>>> / Adam
>>>>
>>>>>>>
>>>>>> Did you use yaz-marcdump for the
conversion?
>>>>>>
>>>>>> Or did you do something else ?
(such as programming towards the
>>>>>> siconv interface)?
>>>>>>
>>>>>> / Adam
>>>>>>
>>>>>>> Thanks
>>>>>>> Gary
>>>>>>>
>>>>>>>
_______________________________________________
>>>>>>> Yazlist mailing list
>>>>>>> Yazlist lists.indexdata.dk
>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
_______________________________________________
>>>>>> Yazlist mailing list
>>>>>> Yazlist lists.indexdata.dk
>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>
>>>>>
>>>>>
_______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlist lists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>>
>>>>
>>>>
>>>>
_______________________________________________
>>>> Yazlist mailing list
>>>> Yazlist lists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlist lists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>>
>
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
|
| Re: Fields ending in combining
diacritics |
  Denmark |
2007-03-12 13:36:24 |
Gary Anderson wrote:
> Adam,
> The specific case I am dealing with is a 670 field
containing only 1
> subfield a that ends with text like: houses built
above 50<degree
> mark><field mark>. Obviously, the providing
library has used the wrong
> character for the degree mark. They used 0xea which is
a combining
> Angstrom diacritic when they should have used 0xc0 -
the degree mark.
> The initial records are in MARC8 encoding. When I run
the translation
> for this, I end up with no errors, but the diacritic
character now
> follows the field mark. What I am interested in is a
way for the siconv
> library to catch this situation, since applying a
diacritic to a control
> character should not be allowed behavior.
> In this vein, maybe you can enlighten me on a question
- I am running a
> tag by tag conversion on records that I process.
Originally, I was not
> including the field mark character in the string sent
to siconv. I
> found, however, that there were some cases where the
conversion state
> was left indeterminate, so I began to include the field
mark in the
> input string. That seems to have fixed all of the
other problems except
> this one. What is your recommended practice for
converting records?
> Should I be including the field mark or not?
>
You should not include the field mark. IMHO, the MARC
control chars are
out of the scope of iconv . Firstly, they should not be
converted.
Secondly, they may harm - in case "some" character
set has special
meaning for them.
And as I said. You'll be getting an error. The iconv will
now say
"incomplete converted sequence" and you'll have no
chars left in the
subfield. So you know something is fishy.
It's been implemented by the yaz-iconv in case you need
sample to code
look at.
For this check out YAZ via CVS.
/ Adam
> Gary
>
> Adam Dickmeiss wrote:
>
>> Adam Dickmeiss wrote:
>>
>>> Gary Anderson wrote:
>>>
>>>> I am not sure how this will help. In the
application, the last 2
>>>> bytes of the data string are oxea and 0x1e
- the diacritic and the
>>>> record mark. yaz_iconv seems to drop the
diacritic because it
>>>> doesn't have a trailing character, but it
does process the record
>>>> mark. What I need is something that will
tell me that this case has
>>>> occurred. It looks to me like yaz just
drops the diacritic.
>>>
>>> I don't see a way the iconv interface could
tell you this. I'm still
>>> a little confused, so forgive me for asking,..
what is the behavior
>>> you want? (keep the diacritic?)
>>>
>>
>> In case you want *not* to keep the diacritic, in
other words you are
>> asking to be notified about an error .. then maybe
it's best to use
>> EINVAL because the iconv man page says:
>>
>> "EINVAL An incomplete multibyte sequence
has been encountered in
>> the input."
>>
>> Case 1:
>> So if you pass
>> .. 0xEA
>> you get EINVAL because no characters follow 0xEA
(as far as iconv is
>> concerned).
>>
>> Case 2: If you pass
>>
>> .. 0xEA 0x1E
>> that would not return an error. In fact YAZ
currently converts this
>> UTF-8:
>> 0x1E 0xCC 0x8A
>> because 0x1E is just a "character".
>>
>> Unfortunately for case 1, YAZ currently returns
'unknown error'.
>> That's no good. This has been fixed in the CVS
version of YAZ.
>>
>> / Adam
>>
>>
>>> / Adam
>>>
>>>>
>>>> My checking indicates that on completion of
conversion of the record
>>>> mark, the yaz_iconv library is left in its
'initial state'. The
>>>> next string converts just fine.
>>>> Gary
>>>>
>>>> Adam Dickmeiss wrote:
>>>>
>>>>> Gary Anderson wrote:
>>>>>
>>>>>> I am using the siconv interface. I
have a programmatic process
>>>>>> that deals with very large files of
records.
>>>>>>
>>>>>> Adam Dickmeiss wrote:
>>>>>>
>>>>>>> Gary Anderson wrote:
>>>>>>>
>>>>>>>> I recently ran some tests
using records from the National
>>>>>>>> Library of Canada. Of the
600,000+ records in their name and
>>>>>>>> subject authority file, six
records had 670 tags where the
>>>>>>>> subfield a data ended in a
combining diacritic character with no
>>>>>>>> following character.
>>>>>>>>
>>>>>>>> Submitting that data string
>>>>>>>>
(indicators+subfieldmark+subfieldcode+data+fieldmark) to
>>>>>>>> siconvert resulted in an
output string that did not contain the
>>>>>>>> diacritic character. It
was dropped. The field mark character
>>>>>>>> was retained. Can you
suggest a means for notifying the caller
>>>>>>>> when this condition occurs?
Byte counts don't really work
>>>>>>>> because UTF8 is one side or
the other of the conversion
>>>>>>>> transaction.
>>>>>>>>
>>>>>>>> The ending diacritic values
were: 0xE2, 0xE5, 0xE8, 0xEA, and
>>>>>>>> 0xF6.
>>>>>>>
>>>>>>>
>>>>>
>>>>> I think you need to do is to
"flush" reset to the "initial state".
>>>>> The flush would take place after a
field or subfield ends.
>>>>>
>>>>> That's done by iconv and, hopefully,
yaz_iconv by setting inbuf or
>>>>> *inbuf to NULL, but outbut to non-NULL,
i.e.
>>>>>
>>>>> yaz_iconv(cd, 0, 0, &outbuf,
&outbytesleft);
>>>>>
>>>>> From 'man 3 iconv':
>>>>> "
>>>>> A different case is when inbuf is NULL
or *inbuf is NULL, but
>>>>> outbuf is
>>>>> not NULL and *outbuf is not NULL. In
this case, the iconv()
>>>>> function
>>>>> attempts to set cd's conversion state
to the initial state and
>>>>> store a
>>>>> corresponding shift sequence at
*outbuf. At most *outbytesleft
>>>>> bytes,
>>>>> starting at *outbuf, will be written.
If the output buffer has no
>>>>> more
>>>>> room for this reset sequence, it sets
errno to E2BIG and
>>>>> returns
>>>>> (size_t)(-1). Otherwise it
increments *outbuf and decrements
>>>>> *out-
>>>>> bytesleft by the number of bytes
written.
>>>>> "
>>>>>
>>>>> Use YAZ 2.1.48 or later for this to
work.
>>>>>
>>>>> / Adam
>>>>>
>>>>>>>>
>>>>>>> Did you use yaz-marcdump for
the conversion?
>>>>>>>
>>>>>>> Or did you do something else ?
(such as programming towards the
>>>>>>> siconv interface)?
>>>>>>>
>>>>>>> / Adam
>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Gary
>>>>>>>>
>>>>>>>>
_______________________________________________
>>>>>>>> Yazlist mailing list
>>>>>>>> Yazlist lists.indexdata.dk
>>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
_______________________________________________
>>>>>>> Yazlist mailing list
>>>>>>> Yazlist lists.indexdata.dk
>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>>>
>>>>>>
>>>>>>
_______________________________________________
>>>>>> Yazlist mailing list
>>>>>> Yazlist lists.indexdata.dk
>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
_______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlist lists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>>>
>>>>
>>>>
_______________________________________________
>>>> Yazlist mailing list
>>>> Yazlist lists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>>
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlist lists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>>
>>>
>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
[1-2]
|
|