|
List Info
Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 18:03:35 |
Marc Lehmann skribis 2007-03-31 0:20 (+0200):
> > > If perl had the abstract model juerd dreams
of
> > and uses in day-to-day coding, without
encountering ANY of the problems
> > that you describe
> Frankly, that is not a very good sign. It means eitehr
you are extremely
> lucky or you don't use any of the many XS modules that
silently break, or
> even the Perl modules (such as the example from
Compress::Zlib) that break
> less silently, but more miraciously.
Most of the time, it's a question of realising that the
module doesn't
do the Perl unicode model, and considering communication
with the module
I/O, i.e. only feed it bytes, and only get bytes back.
Encode and decode
as appropriate.
I maintain a short list of some modules at
http://juerd.nl/perlun
iadvice. If you encounter modules that I can test
easily without setting up complete environments, please let
me know!
Compress::Zlib sounds like it uses zlib, which compresses
byte streams.
i.e. don't give it unicode strings, because unicode strings
have no
bytes (the bytes are internal only, but you don't know what
encoding is
used there). Encode explicitly.
> And they do so for some of my other modules doing that,
too. And there are
> two options to me: either tlel them perl is broken
w.r.t. to e.g. "C", or
> their code is broken becasue they do not call
downgrade.
Their code is probably broken because they mix text strings
with byte
strings. This can be solved most easily by explicitly
encoding your text
string as soon as you feel you must join it with a byte
string. The
joined string as a byte string. Decoding it to make a text
string may or
may not make sense, depending on the data format.
> I find "text strings" and "byte
strings" not adequate either, as Perl
> makes no difference between those two concepts (being
typeless)
Indeed. Programmers have to track this themselves. Sometimes
that sucks,
but in my experience, you need to know what kind of data
your variable
contains anyway.
If you ++ a reference, you're in for trouble too. How come
that's never
been a problem? Probably because programmers are pretty good
at knowing
what functions their variables have.
It's just that this is something you haven't needed to know
before, so
you're not /trained/ yet to think about it. But you can't go
from 256
characters to several thousands without changing the way you
think
> they do not map well to encoded/decoded text either
Oh, but they do. Please read perlunitut, which tries to
redefine the
universe into four important definitions (and succeeds).
1. Byte strings (aka binary strings)
2. Text strings (aka unicode strings or "internal
format" strings)
3. Decoding is byte --> text
4. Encoding is text --> byte
> Perl only knows how toc oncatenate characters, it does
not know
> anything about byte or text, so utf8::encode does not
necesarily
> create a byte string out of a text string.
I don't get the causal connection you're illustrating.
utf8::encode takes any text string (or unicode string, if
you prefer
that term) and turns it into a UTF-8 encoded byte string in
place.
That is,
utf8::encode($foo);
is the efficient equivalent of:
$foo = encode("utf8", $foo);
Note that whenever a string has an encoding attach to it,
conceptually,
it's automatically a byte string. Text strings don't have
encodings,
because encodings are a byte thing, and text strings don't
have bytes;
they have characters. (Text strings have encodings and
bytes
/internally/, just like numbers do have bytes /internally/,
encoded in
one way or another, that allows values greater than 255 or
less than 0.)
> It could juts as well create a text string out of a
byte string (think
> JSON, which creates json _text_ out of e.g. byte
strings by encoding
> them to UTF-8).
utf8::encode is a text operation. It will assume that
whatever you give
it, is a text string. Its characters are considered Unicode
codepoints.
You shouldn't give it a byte string.
To understand what happens if you do give utf8::encode a
byte string,
you need to know some internals. But I stress that this is
not required
knowledge, because it's so much easier to just remember not
to do this
weird thing. Why would you try to encode a byte string to
UTF-8, anyway?
That makes no sense, because UTF-8 is a means of
representing
characters. Byte strings consist of bytes, not characters.
Here's what happens internally: Any byte string used as
a text
string is considered to be encoded in latin1, because
Perl doesn't
know the difference.
> (or my programs either). It might be a good and
simplified advice to a
> beginner
The theory is very simple, but not simplified. It just isn't
any harder.
I'm sorry if you want a more complex programming tool. But
apparently
you have found ways to make it hard for yourself already
> though, although I prefer to never tell people
simplified (but wrong)
> things.
I agree. Whenever I use a simplified view, that will be
obvious or
mentioned. Metadata ("this information is wrong, but
useful anyway") is
very important.
> The perl unicode model is rather simple, but leaves you
in control,
> and I found teaching people about how perl just allows
more than
> 0..255 for a character index works best (although
people differ).
That's a great explanation of how unicode strings work. But
when people
write programs, these programs typically accept input and
also have some
output. And then you're doing I/O, which is done with bytes,
and
requires character encodings in order to communicate
characters. You
used to be able to ignore this fact when everyone still used
iso-8859-1,
I mean CP437, I mean CP850, I mean koi8-r, I mean
Windows-1252. Right,
we never did all use exactly the same encoding. We've just
chosen to
remain ignorant all this time. Explicit re-encoding, or
decoding and
encoding has been necessary all this time. It's just that
with more than
256 codepoints, it became much more apparent
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 18:33:55 |
On Sat, Mar 31, 2007 at 01:03:35AM +0200, Juerd Waalboer
<juerd convolution.nl> wrote:
> I maintain a short list of some modules at
> http://juerd.nl/perlun
iadvice. If you encounter modules that I can test
> easily without setting up complete environments, please
let me know!
>
> Compress::Zlib sounds like it uses zlib, which
compresses byte streams.
> i.e. don't give it unicode strings, because unicode
strings have no
> bytes (the bytes are internal only, but you don't know
what encoding is
> used there). Encode explicitly.
The difference between us, and thats what it boils down to,
is that you give
the internal UTF-X bit meaning. You equate UTF-X flag set ==
Unicode string.
To me, a unicode string is a concept outside of perl. I
would consider any
text string using the unicode codepoints a unicode string.
For example:
"hallo" is a unicode string. Any any binary string
is not a unicode string.
The problem with your approach is that you have to expose
the UTF-X flag
to users. Which comes with a lot of problems.
Please note that in the actual problem, nobody is passing
unicode to
compress::zlib. Instead, a binary string is passed to
Compress::Zlib that
happens to be UTF-X encoded internally because it was
transferred using a
protocol that encodes bytes as UTF-8 (namely JSON), and the
decoder opted
not to make another copy of the data for speed reasons.
Compress::Zlib is not buggy. Neither is the caller. The bug
is that unpack
treats the same string differently depending on an internal
flag that might
be set for a variety of reasons outside the programmers
control.
Initially I thought you, too, wanted a unicode model where
the UTF-X bit is
not exposed to the perl level. But in fact the opposite is
true: you force
knowledge of the UTF-X bit on users, even though it should
be transparent.
Thats the problem. As logn as you call UTF-X-encoded strings
Unicode
strings and something else byte strings and try to give them
meaning
the programmer has to know about it, as functions behave
semantically
differently depending on that flag.
All I want is a perl that behaves semnatically consistent,
regardless
of some internal flag that is documented not to be of
concern to a Perl
programmer.
> Their code is probably broken because they mix text
strings with byte
> strings. This can be solved most easily by explicitly
encoding your text
> string as soon as you feel you must join it with a byte
string. The
> joined string as a byte string. Decoding it to make a
text string may or
> may not make sense, depending on the data format.
my $bytestring = "zlib-encoded string";
my $transfer = Encode::encode_utf8 $bytestring;
my $bytes = Encode::decode_uf8 $transfer;
$bytes is the same string, but depending on implementation
details of
Perl, it is treated diferently in different contexts,
sometimes it is
treated like the binary string it is, sometimes it is trated
as if it were
utf-8 encoded, which it isn't, as I decoded it.
> > I find "text strings" and "byte
strings" not adequate either, as Perl
> > makes no difference between those two concepts
(being typeless)
>
> Indeed. Programmers have to track this themselves.
Sometimes that sucks,
> but in my experience, you need to know what kind of
data your variable
> contains anyway.
the problem is you want them to track the UTF-X flag in
addition to that.
Because putting a "byte string" into unpack should
not work if that bit
happens to be set. So you force people who want to use
unpack to learn about
that flag, when it is set, when they have to downgrade etc.
etc.
> If you ++ a reference, you're in for trouble too. How
come that's never
> been a problem?
Because perl treats it consistently.
> It's just that this is something you haven't needed to
know before, so
> you're not /trained/ yet to think about it. But you
can't go from 256
> characters to several thousands without changing the
way you think
Yes. Thats not a problem, I understand unicode quite well,
and I udnerstand
quite well how Perl stores unicode.
What the problem is is that I separate internal encoding
(unicode can be
encoded both in UTF-X as well as in octets, as can byte
strings) from the
unicode model in Perl, while you mix them together, forcing
the user to
know their UTF-X bits on their scalars in addition to
tracking wether they
are binary or not.
> > they do not map well to encoded/decoded text
either
>
> Oh, but they do. Please read perlunitut, which tries to
redefine the
> universe into four important definitions (and
succeeds).
I do not have that manpage.
> 1. Byte strings (aka binary strings)
>
> 2. Text strings (aka unicode strings or "internal
format" strings)
>
> 3. Decoding is byte --> text
>
> 4. Encoding is text --> byte
That doesn't reflect reality, of course, if it were so.
However, those four definitions, as I said, do not map well
to
encoded/decoded text. Because "internal format"
strings can store binary
data just as well, and often does.
I am talking purely about the perl level strings. If
perlunitut confused
the issue by talking about internal encoding it completely
failed its
mission, imho.
> I don't get the causal connection you're illustrating.
>
> utf8::encode takes any text string (or unicode string,
if you prefer
> that term) and turns it into a UTF-8 encoded byte
string in place.
No. It converts characters to UTF-X encoded octets. Wether
my characters
are bytes or not is of no consequence.
> Note that whenever a string has an encoding attach to
it, conceptually,
> it's automatically a byte string.
Yes. And that encoding is completely independent of the
internal UTF-X
flag. Or should be, but isn't, in current perls.
> Text strings don't have encodings,
> because encodings are a byte thing, and text strings
don't have bytes;
> they have characters. (Text strings have encodings and
bytes
Perl doesn't know about that. It only knows about
characters. The problem
is that some parts of perl make a difference bewteen the
very same string,
depending on how it is encoded internally, _even if the
encoding is the
same on the Perl level_.
> /internally/, just like numbers do have bytes
/internally/, encoded in
> one way or another, that allows values greater than 255
or less than 0.)
Exatcly. But nothing in perl forces those indices to be
unicode characters.
Certainly not the indices 0..255. Yet still, the UTF-X flag
might be set or
cleared, resulting in changes in interpretation.
I want those to go away and make perl treat my binary data
as binary data,
regardless of how the interpreter treats them.
> utf8::encode is a text operation. It will assume that
whatever you give
> it, is a text string. Its characters are considered
Unicode codepoints.
Where does it say so?
> You shouldn't give it a byte string.
Please leave it up to me what I should or should not to.
This whole
discussion of what I should or should not to is completey
besides the
point.
The point is that Perl treats my strings the same in
utf8::encode, regardless
of how the UTF-X flag is set, because upgrading or
downgrading does not
change the semantics of my characters.
But in unpack, it does. Thats the problem. Bot what I should
or should not
do. The problem is givign unpack a binary strings makes it
return garbage
sometimes (if the binary string happens to be encoded
internally in UTF-X).
This whole "force the user to track the UTF-X bit is
useless". If you
really want that, then go back to 5.005_5x, which forces you
to track
your UTF-8 on your own. The whole point of the big change in
5.6 was that
programmers should not care about how perl internally
encodes stuff, and I
certainly do not want to give this up. Thats what makes perl
so good.
> To understand what happens if you do give utf8::encode
a byte string,
A byte string is a string containing only octets, that is,
values between 0
and 255.
Without knowing any intenals, utf8::encode will encode it
into a UTF-8
encoded sequence.
> you need to know some internals.
Wrong. I need know no internals, the result is always
well-defined: put
characters into utf8::encoede, and get utf-8-encoded
characters. No need for
internals knowledge, regardless of wether my characters are
0..255 or some of
them happen to be larger. Perl doesn't care, nor does UTf-8
care, nor do I
care.
The problem is, perl cares in unpack, and when handing
strings over to XS
modules.
> That makes no sense, because UTF-8 is a means of
representing
> characters. Byte strings consist of bytes, not
characters.
Not in C, which is what the documentation constantly refers
to, mind
you. And no, a byte always has been a character. It is the
very definition
of byte in C, regardless of how many bits it has. And the
same is true in
perl: a single bate is represented by a single character,
havign an index
no higher than 255.
> > (or my programs either). It might be a good and
simplified advice to a
> > beginner
>
> The theory is very simple, but not simplified. It just
isn't any harder.
It doesn't map to reality.
> I'm sorry if you want a more complex programming tool.
But apparently
> you have found ways to make it hard for yourself
already
Just stop your ad-hominem, please. I told you before that I
find it
rather easy, but users of my module find it rather hard, for
example. I
worked around a lot of bugs in 5.6 easily, and can slap an
occasional
utf8::up/downgrade into my code. But I think its simply
wrong to force
every programmer to know as much about the internals as I
do.
> > The perl unicode model is rather simple, but
leaves you in control,
> > and I found teaching people about how perl just
allows more than
> > 0..255 for a character index works best (although
people differ).
>
> That's a great explanation of how unicode strings
work.
You think so? Then why do you want to force people to know
about how
128..255 is encoded internally then? Because you do when say
that UTF-X
always means text (which is not true in reality, mind you),
and you want
unpack to fail on binary strings that happen to be UTF-X
encoded?
> we never did all use exactly the same encoding. We've
just chosen to
> remain ignorant all this time. Explicit re-encoding, or
decoding and
> encoding has been necessary all this time. It's just
that with more than
> 256 codepoints, it became much more apparent
Right. But at leats when dealing with decoded stuff (such as
binary data),
Perl should behave consistently and correctly, but it
doesn't.
--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg goof.com
--==---/ / _ / // / / / http://schmorp.de/
-=====/_/_//_/_,_/ /_/_ XX11-RIPE
|
|
[1-2]
|
|