|
List Info
Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-31 00:55:56 |
On Sat, Mar 31, 2007 at 03:53:25AM +0200, Juerd Waalboer
<juerd convolution.nl> wrote:
> Juerd Waalboer skribis 2007-03-30 21:53 (+0200):
> > Personally, I think that unpack with a
byte-specific signature should
> > die, or at least warn, when its operand has the
UTF8 flag set.
>
> I've since this post changed my mind, and think it
should only warn if
We are making progress, and I would actually be content with
that
solution, but it does break "U". The solution,
really, is to treat C like
an octet in the same way "n" is treated like two
octets. That does not
break existing code and is what many perl programmers find
naturally.
Since so many people are confused about why the unpack
change breaks code, I
will explain it differently:
my $k = "x10x00";
die unpack "n", $k;
this gives me 4096. "n" is documented to take
exactly 16 bits, two octets.
I get 4096 regardless of how perl chooses to represent it
internally: If
perl goes to using UCS-4 (something that won't happen for
sure, but has
been stated before to remind people that internal encoding
can change), it
would still work.
Same thing for "L", which is documented to be
exactly 32 bit.
Now, when people want an 8 bit value followed by a 16 bit
big endian value,
they used "Cn" in the old times. In fact, they
still use that, as "C"
always has been the octet companion to the 16 bit and 32 bit
sSlLnNvV etc.
However, in a weird stroke, somebody decided that
"C" no longer gives
you a single octet of your string, but, depending on
internal encoding,
depending on an internal flag, part of that octet or the
octet.
Now, what has been unpack "CCV" in perl 5.005 must
be written as unpack
"UUV" in perl 5.8, as "U" has the right
semantics for decoding a single
octet out of a binary string.
Thats weird, because now code that _doesn't_ want to deal
with unicode at
all, but in fact only deals with binary data must use this
unicode thingy
"U", even though the documentation for
"C" clearly says its an octet, and
even says its an octet in C, which is exactly what those
people decoding
structures or network packets want.
That is the problem.
Now, I don't mind at all if I get a die when trying
"C" on a
byte=character that is >255 (i.e. not representable as an
object). Or a die
when attempting that on a two byte=character string with
"n".
I personally dislike the warning, because the warning only
ever comes up
when there is a bug. It doesn't matter much to me
persoanlly, though.
What matters to me is that binary-only code now needs to use
"U" when
formerly "C" as meant to get correct behaviour.
This *needs* to be fixed.
--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg goof.com
--==---/ / _ / // / / / http://schmorp.de/
-=====/_/_//_/_,_/ /_/_ XX11-RIPE
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-31 04:48:48 |
Marc Lehmann skribis 2007-03-31 7:55 (+0200):
> > > Personally, I think that unpack with a
byte-specific signature should
> > > die, or at least warn, when its operand has
the UTF8 flag set.
> > I've since this post changed my mind, and think it
should only warn if
> We are making progress, and I would actually be content
with that
> solution, but it does break "U".
No, breaking U does not occur, because it's not in my list
of
byte-specific (un)pack templates. U is for unicode
characters.
> The solution, really, is to treat C like
> an octet in the same way "n" is treated like
two octets.
It does that, but we're having a very different
understanding of the
word "octet", and my hands hurt, so I'm not going
through it all again.
> Since so many people are confused about why the unpack
change breaks code, I
> will explain it differently:
> my $k = "x10x00";
> die unpack "n", $k;
> this gives me 4096. "n" is documented to take
exactly 16 bits, two octets.
juerd lanova:~$ perl -le'print unpack "n",
"x"'
57986
"x" is one character, but "n"
works on octets, not characters.
This uses the internal buffer without warning, and picks the
first two
octets of the three-octet secuence e2 82 ac. This octet
sequence should
be hidden from the programmer, but it is too late for that.
So instead,
let's warn the programmer that what's going on is very
probably not what
they intended.
juerd lanova:~$ perl -le'print unpack "n",
"xe2x82"'
57986
The annoying thing for people who don't know when Perl
upgrades strings,
is when you started with a nice 2-octet byte string, and it
got upgraded
somewhere. Here, forced for illustration, and using the same
2-octet
sequence so the difference in results is obvious:
juerd lanova:~$ perl -le'$foo = "xe2x82";
utf8::upgrade($foo); print
unpack "n", $foo'
50082
A warning about the wide characters here would be in order
and save
people's butts.
> I get 4096 regardless of how perl chooses to represent
it internally
Because Perl always uses latin1 or utf8 internally, in both
of which
x10 and x00 are octets 0x10 and 0x00 respectively.
> If perl goes to using UCS-4 (something that won't
happen for sure, but
> has been stated before to remind people that internal
encoding can
> change), it would still work.
Not as far as I can tell, because Perl uses the raw octets
of the
internal encoding whenever you do byte-specific operations,
and the
internal encoding for U+0010 and U+0000 changes when you go
from UTF-8
to UCS-4.
That's why it's so darn useful to use latin1 when possible,
because you
can then be pretty sure that "x10x00" will be
the two octets you
expect. (Note that breaking this is the main breakage caused
by
encoding.pm.)
> However, in a weird stroke, somebody decided that
"C" no longer gives
> you a single octet of your string, but, depending on
internal encoding,
> depending on an internal flag, part of that octet or
the octet.
What you call "octet", I call
"character". And I'll never call that
"octet" or "byte" because then none of
the documentation about all this
would still be right, and Perl would suddenly indeed be
broken.
If you insist on calling the value of "x" a
single octet, then
indeed pack/unpack will not do what you want, because what
you want is
just not how it works.
"x" is one character. Internally,
represented by three octets.
The internal representation is used, if you unpack with
byte-specific
templates like "C" or "n".
Byte strings, i.e. strings with no character values >255
that have never
been in contact with UTF-8 encoded strings, may be
interpreted as latin1
and internally converted to UTF-8 when you join them with
text strings.
This causes unpack to see very different values, and that's
one of the
reasons one should avoid mixing byte strings and text
strings.
Note that my definition of "text string" excludes
byte encoded strings,
such as the results of encode() or utf8::encode().
> Now, what has been unpack "CCV" in perl 5.005
must be written as unpack
> "UUV" in perl 5.8, as "U" has the
right semantics for decoding a single
> octet out of a binary string.
> Thats weird
Weird only because you choose to use a different meaning of
the word
"octet" than much of the rest of the world.
> Now, I don't mind at all if I get a die when trying
"C" on a
> byte=character that is >255 (i.e. not representable
as an object).
Just so other people know: since Perl has had Unicode
support, there has
been a consistent effort to teach people that character !=
byte, and
that a single character may consist of several bytes.
In fact, this effort has been present in larger parts of
computing than
just Perl, but for clarity's sake, I'm sticking to Perl
because
sometimes Perl's definitions differ. (For example, in Perl,
a character
is a single code point, while in Unicode, a character can be
composed
out of several combining code points.)
Also, values greater than 255 do not fit in a single byte,
according to
computer science that decided that byte==octet==8 bits. 8
bits simply
simply hold only 2**8==256 values. Hence the need for a
distinction
between bytes, and things that *are* able to hold other
values.
> I personally dislike the warning, because the warning
only ever comes up
> when there is a bug.
I love warnings that only ever come up when I have a bug. In
fact, I
generally dislike warnings that don't follow that pattern.
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
[1-2]
|
|