|
List Info
Thread: Error in latest SICONV.C
|
|
| Error in latest SICONV.C |
  United States |
2007-03-21 20:05:51 |
Adam,
I've run through the latest siconv.c from your cvs
repository. It now
has additional problems with converting UTF8 to MARC8. I
ran a
simplified test. The input string was (hex bytes, ignore
spaces):
61 E8 87 BA E7 81 A3 E5 BE 97
The output string was:
61 1B 24 31 21 54 2B 21 49 43
Notice that only the first two hangul triples were converted
and
output. The ending ESC string was also not output. This
happened
because the normal flow of the logic does NOT call your
flush function.
In fact in siconv.c version $Id: siconv.c,v 1.37 2007/03/20
21:37:32
adam Exp $, yaz_iconv exits at the point shown below.
while (1)
{
unsigned long x;
size_t no_read;
if (cd->unget_x)
{
x = cd->unget_x;
no_read = cd->no_read_x;
}
else
{
if (*inbytesleft == 0)
{
r = *inbuf - inbuf0;
break; <-------------------This is the
exit point.
}
I think what you really meant to do was this:
while (1)
{
unsigned long x;
size_t no_read;
if (cd->unget_x)
{
x = cd->unget_x;
no_read = cd->no_read_x;
}
else
{
if (*inbytesleft == 0)
{
if (cd->flush_handle) // Proposed fix
to correctly
clear the buffers
r = (*cd->flush_handle)(cd, outbuf,
outbytesleft);
// Proposed fix to correctly clear the buffers.
r = *inbuf - inbuf0;
break; <-------------------This is the
exit point.
}
Anyway, I tried the code with this fix, and it works
properly. Is this
the correct fix, or should this problem be fixed another
way?
Gary
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
|
| Re: Error in latest SICONV.C |
  Denmark |
2007-03-22 03:04:54 |
Gary Anderson wrote:
> Adam,
>
> I've run through the latest siconv.c from your cvs
repository. It now
> has additional problems with converting UTF8 to MARC8.
I ran a
> simplified test. The input string was (hex bytes,
ignore spaces):
>
> 61 E8 87 BA E7 81 A3 E5 BE 97
>
> The output string was:
>
> 61 1B 24 31 21 54 2B 21 49 43
That might very well be with your test code, which I
unfortunately don't
know anything about.
But I think I can guess what goes on: you did not flush the
output
buffers with yaz_iconv(cd, 0,0, ..) .
Why is _separate_ flushing a good idea? What can't we call
iconv once?
Suppose we did not have it (implicit flushing for every call
of iconv)
and we want to convert a file. We read chunks from disk (say
2K at a
time). We call iconv. iconv flushes every time. But when it
flushes it
assumes that the input is complete. Sometimes it ain't.
There will be
input sequences "crossing" block boundaries. And
so we'll end up with
errors. So we just read the whole in memory? No? Can mmap do
it?
The opposite is much more appealing . We read chunks of
almost size we
want. We don't care about the input. We don't *know* the
input. So we
call iconv for each chunk . It's a streaming process. At EOF
we flush.
Of course, if it's a small and *complete* buffer in memory
it might be
ackward.. We'd have two calls instead of one. But I am sure
you can make
a wrapper yaz_iconv_all(cd, a, b, c, d); which calls
yaz_iconv(cd, a, b,
c, d) + yaz_iconv(cd, 0, 0, c, d);
This is what your patch effectively does , I think.
/ Adam
> Notice that only the first two hangul triples were
converted and
> output. The ending ESC string was also not output.
This happened
> because the normal flow of the logic does NOT call your
flush function.
> In fact in siconv.c version $Id: siconv.c,v 1.37
2007/03/20 21:37:32
> adam Exp $, yaz_iconv exits at the point shown below.
>
> while (1)
> {
> unsigned long x;
> size_t no_read;
>
> if (cd->unget_x)
> {
> x = cd->unget_x;
> no_read = cd->no_read_x;
> }
> else
> {
> if (*inbytesleft == 0)
> {
> r = *inbuf - inbuf0;
> break; <-------------------This is
the exit point.
> }
>
> I think what you really meant to do was this:
> while (1)
> {
> unsigned long x;
> size_t no_read;
>
> if (cd->unget_x)
> {
> x = cd->unget_x;
> no_read = cd->no_read_x;
> }
> else
> {
> if (*inbytesleft == 0)
> {
> if (cd->flush_handle) // Proposed
fix to correctly
> clear the buffers
> r = (*cd->flush_handle)(cd,
outbuf, outbytesleft); //
> Proposed fix to correctly clear the buffers.
> r = *inbuf - inbuf0;
> break; <-------------------This is
the exit point.
> }
>
> Anyway, I tried the code with this fix, and it works
properly. Is this
> the correct fix, or should this problem be fixed
another way?
>
> Gary
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
| Re: Error in latest SICONV.C |
  United States |
2007-03-22 12:37:19 |
This is a different philosophy that I picked up from your
earlier
emails. Those indicated that I shouldn't be submitting
subfield marks
and field marks to the converter. The MARC-21 standard
specifies that
fixed field and subfield strings are self-contained and ESC
sequences in
MARC8 encoding don't cross field or subfield boundaries.
That is why I
suggested the fix the way I did. Now you seem to be saying
that we can
and should run the data in whatever size chunks we want
until an EOF is
reached. So, rather than relying on the library to give me
back
self-contained data in conversions to and from MARC8, I have
two
choices. First, I can append a period to the end of every
string I
convert and then kill it when the string comes back (that
works, by the
way) or I can make a convert call then another convert call
to flush the
buffers. That seems a little more cumbersome and time
consuming.
Gary
Adam Dickmeiss wrote:
> Gary Anderson wrote:
>
>> Adam,
>>
>> I've run through the latest siconv.c from your cvs
repository. It
>> now has additional problems with converting UTF8 to
MARC8. I ran a
>> simplified test. The input string was (hex bytes,
ignore spaces):
>>
>> 61 E8 87 BA E7 81 A3 E5 BE 97
>>
>> The output string was:
>>
>> 61 1B 24 31 21 54 2B 21 49 43
>
>
> That might very well be with your test code, which I
unfortunately
> don't know anything about.
>
> But I think I can guess what goes on: you did not flush
the output
> buffers with yaz_iconv(cd, 0,0, ..) .
>
> Why is _separate_ flushing a good idea? What can't we
call iconv once?
>
> Suppose we did not have it (implicit flushing for every
call of iconv)
> and we want to convert a file. We read chunks from disk
(say 2K at a
> time). We call iconv. iconv flushes every time. But
when it flushes it
> assumes that the input is complete. Sometimes it ain't.
There will be
> input sequences "crossing" block boundaries.
And so we'll end up with
> errors. So we just read the whole in memory? No? Can
mmap do it?
>
> The opposite is much more appealing . We read chunks of
almost size we
> want. We don't care about the input. We don't *know*
the input. So we
> call iconv for each chunk . It's a streaming process.
At EOF we flush.
>
> Of course, if it's a small and *complete* buffer in
memory it might be
> ackward.. We'd have two calls instead of one. But I am
sure you can
> make a wrapper yaz_iconv_all(cd, a, b, c, d); which
calls
> yaz_iconv(cd, a, b, c, d) + yaz_iconv(cd, 0, 0, c,
d);
> This is what your patch effectively does , I think.
>
> / Adam
>
>> Notice that only the first two hangul triples were
converted and
>> output. The ending ESC string was also not output.
This happened
>> because the normal flow of the logic does NOT call
your flush
>> function. In fact in siconv.c version $Id:
siconv.c,v 1.37
>> 2007/03/20 21:37:32 adam Exp $, yaz_iconv exits at
the point shown
>> below.
>>
>> while (1)
>> {
>> unsigned long x;
>> size_t no_read;
>>
>> if (cd->unget_x)
>> {
>> x = cd->unget_x;
>> no_read = cd->no_read_x;
>> }
>> else
>> {
>> if (*inbytesleft == 0)
>> {
>> r = *inbuf - inbuf0;
>> break;
<-------------------This is the exit point.
>> }
>>
>> I think what you really meant to do was this:
>> while (1)
>> {
>> unsigned long x;
>> size_t no_read;
>>
>> if (cd->unget_x)
>> {
>> x = cd->unget_x;
>> no_read = cd->no_read_x;
>> }
>> else
>> {
>> if (*inbytesleft == 0)
>> {
>> if (cd->flush_handle) //
Proposed fix to correctly
>> clear the buffers
>> r = (*cd->flush_handle)(cd,
outbuf, outbytesleft);
>> // Proposed fix to correctly clear the buffers.
>> r = *inbuf - inbuf0;
>> break;
<-------------------This is the exit point.
>> }
>>
>> Anyway, I tried the code with this fix, and it
works properly. Is
>> this the correct fix, or should this problem be
fixed another way?
>>
>> Gary
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
>
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
|
| Re: Error in latest SICONV.C |
  Denmark |
2007-03-22 13:10:45 |
Gary Anderson wrote:
> This is a different philosophy that I picked up from
your earlier
> emails. Those indicated that I shouldn't be submitting
subfield marks
> and field marks to the converter. The MARC-21 standard
specifies that
> fixed field and subfield strings are self-contained and
ESC sequences in
> MARC8 encoding don't cross field or subfield
boundaries. That is why I
> suggested the fix the way I did. Now you seem to be
saying that we can
> and should run the data in whatever size chunks we want
until an EOF is
EOF would be end of data field too.
> reached. So, rather than relying on the library to
give me back
> self-contained data in conversions to and from MARC8, I
have two
> choices. First, I can append a period to the end of
every string I
> convert and then kill it when the string comes back
(that works, by the
> way) or I can make a convert call then another convert
call to flush the
> buffers. That seems a little more cumbersome and time
consuming.
If two calls to iconv bothers you, make a wrapper or just
change the
source of YAZ.
/ Adam
>
> Gary
>
> Adam Dickmeiss wrote:
>
>> Gary Anderson wrote:
>>
>>> Adam,
>>>
>>> I've run through the latest siconv.c from your
cvs repository. It
>>> now has additional problems with converting
UTF8 to MARC8. I ran a
>>> simplified test. The input string was (hex
bytes, ignore spaces):
>>>
>>> 61 E8 87 BA E7 81 A3 E5 BE 97
>>>
>>> The output string was:
>>>
>>> 61 1B 24 31 21 54 2B 21 49 43
>>
>>
>> That might very well be with your test code, which
I unfortunately
>> don't know anything about.
>>
>> But I think I can guess what goes on: you did not
flush the output
>> buffers with yaz_iconv(cd, 0,0, ..) .
>>
>> Why is _separate_ flushing a good idea? What can't
we call iconv once?
>>
>> Suppose we did not have it (implicit flushing for
every call of iconv)
>> and we want to convert a file. We read chunks from
disk (say 2K at a
>> time). We call iconv. iconv flushes every time. But
when it flushes it
>> assumes that the input is complete. Sometimes it
ain't. There will be
>> input sequences "crossing" block
boundaries. And so we'll end up with
>> errors. So we just read the whole in memory? No?
Can mmap do it?
>>
>> The opposite is much more appealing . We read
chunks of almost size we
>> want. We don't care about the input. We don't
*know* the input. So we
>> call iconv for each chunk . It's a streaming
process. At EOF we flush.
>>
>> Of course, if it's a small and *complete* buffer in
memory it might be
>> ackward.. We'd have two calls instead of one. But I
am sure you can
>> make a wrapper yaz_iconv_all(cd, a, b, c, d); which
calls
>> yaz_iconv(cd, a, b, c, d) + yaz_iconv(cd, 0, 0,
c, d);
>> This is what your patch effectively does , I
think.
>>
>> / Adam
>>
>>> Notice that only the first two hangul triples
were converted and
>>> output. The ending ESC string was also not
output. This happened
>>> because the normal flow of the logic does NOT
call your flush
>>> function. In fact in siconv.c version $Id:
siconv.c,v 1.37
>>> 2007/03/20 21:37:32 adam Exp $, yaz_iconv exits
at the point shown
>>> below.
>>>
>>> while (1)
>>> {
>>> unsigned long x;
>>> size_t no_read;
>>>
>>> if (cd->unget_x)
>>> {
>>> x = cd->unget_x;
>>> no_read = cd->no_read_x;
>>> }
>>> else
>>> {
>>> if (*inbytesleft == 0)
>>> {
>>> r = *inbuf - inbuf0;
>>> break;
<-------------------This is the exit point.
>>> }
>>>
>>> I think what you really meant to do was this:
>>> while (1)
>>> {
>>> unsigned long x;
>>> size_t no_read;
>>>
>>> if (cd->unget_x)
>>> {
>>> x = cd->unget_x;
>>> no_read = cd->no_read_x;
>>> }
>>> else
>>> {
>>> if (*inbytesleft == 0)
>>> {
>>> if (cd->flush_handle) //
Proposed fix to correctly
>>> clear the buffers
>>> r =
(*cd->flush_handle)(cd, outbuf, outbytesleft);
>>> // Proposed fix to correctly clear the
buffers.
>>> r = *inbuf - inbuf0;
>>> break;
<-------------------This is the exit point.
>>> }
>>>
>>> Anyway, I tried the code with this fix, and it
works properly. Is
>>> this the correct fix, or should this problem be
fixed another way?
>>>
>>> Gary
>>>
>>>
_______________________________________________
>>> Yazlist mailing list
>>> Yazlist lists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>>
>
> _______________________________________________
> Yazlist mailing list
> Yazlist lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|
|
[1-4]
|
|