|
List Info
Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 14:53:52 |
Marc Lehmann skribis 2007-03-30 14:02 (+0200):
> > The *conceptual* purpose of the UTF8 flag isn't
there. Conceptually,
> > every string can be a unicode string, and you're
not supposed to look
> > at, know, or set the UTF8 flag yourself. It's an
internal bit, like IOK
> > and NOK. [1]
> Thats not how current perl works.
We must have differing definitions, somewhere.
> > Perl conceptually has a single numeric type, and a
single string type.
> > The distinction between integer and float, and
between iso-8859-1 and
> > utf-8, is internal.
> I would love if that were the case, but the powers to
be decided that every
> perl progarmmer has to know those internals, and needs
to be able to deal
> with them.
The best approach to programming with unicode in mind, in
Perl, is to
(pretend to) be completely ignorant about Perl's internals
with regards
to encoding and the UTF8 flag.
The only exception is the regex engine, which has a big bug.
This can be
worked around, again without any knowledge of the internals,
by
utf8::upgrade'ing both sides of the regex before trying the
match.
Your powers-that-be, might be different. Also, don't confuse
"you can
know what Perl does internally" with "you have to
know what Perl does
internally".
Just being able to access internal metadata doesn't mean you
should
actually do so on a daily basis.
It's entirely possible to make undef writable, and have it
equal 42.
No-one is complaining about that, and only very few people
ever get the
idea of changing the value of undef.
It's also entirely possible to set the internal flag
"UTF8" on an
existing string. But for some reason a lot of people are
complaining
about that, and even more people have actually set UTF8
flags
themselves...
> > Note that Perl internally uses iso-8859-1 (8 bit)
and utf-8 (variable
> > whole-octet), not ascii (7 bit).
> No, Perl exposes this. For example, see the recent
example of Compress::Zlib:
> unpack ('CCCCVCC', $$string);
> that code is broken because the powers to be decided
that "C" exposes the
> internal encoding, while "V" doesn't.
Yes, any byte-specific operation on a text string (which I
keep separate
from character strings) will use the internal encoding. It
has to use
/some/ encoding, because it cannot see whether the string
was meant as a
byte string or a text string. Perl does not have strong
typing.
Personally, I think that unpack with a byte-specific
signature should
die, or at least warn, when its operand has the UTF8 flag
set. That'll
catch at least some of the cases, because the UTF8 flag
always
positively indicates that the string is a text string. (The
reverse,
however, is not true: a string without the UTF8 string might
be either a
text string or a byte string.)
> That requires every perl programmer who decodes file
headers etc.
> using unpack to know about those internals.
No, it requires every Perl programmer to keep track of the
function of
every string.
Byte strings and text strings must never be combined, and
text strings
must never undergo byte-specific operations.
This again requires no knowledge of the actual encoding that
Perl uses
internally, whatsoever.
> The same is true for many XS modules: in older versions
of perl, SvPV gave
> you the 8-bit version of a scalar, but in current
versions, it randomly
> gives you either 8-bit or utf-8 encoded. SvPV was
renamed to SvPVbyte.
Unfortunately, I lack knowledge of these internals, so I
cannot comment
about this (yet).
Note that XS writers must have knowledge of Perl's
internals. This has
always been true, and is not specific to this fancy new
Unicode thing.
> And the problem is that those bugs are not considered
bugs but features.
Some bugs are acknowledged as bugs, but won't be fixed
anyway, because
there is already a lot of code in the wild that depends on
the bugs.
> > [1] Some parts of Perl break this concept. The
regex engine is one of
> > them, and has different semantics depending on the
presence of the flag.
> > This is a bug, but any fix would be incompatible.
> In fact, some parts of perl break this concept and make
perfectly working
> code (in 5.005) not working anymore, or working
randomly, and thats not
> considered a bug.
Personally I'm only interested in 5.8.2 and later, but I
still would
like to learn about this history.
> unpack "C", $s;
The C template for unpack is specifically documented as
byte-specific.
It should never be used on text strings. If you properly
keep text and
byte strings separate, that means that your byte string was
never
upgraded, and that unpacking with "C" is reliable
and predictable.
If upgrading happened even though the string was not mixed
with text
strings or used with unicode semantics, that is a bug. I'm
very
interested in these silent upgrades that you are
experiencing.
> If you think it is obvious, how about this:
>
> my $s = chr 255; # to me, this is one octet. to
perl, it might be one or
> # two, or maybe more, who knows.
> warn unpack "C", $s;
> "$sx";
> warn unpack "C", $s;
> $s .= "x"; substr $s, 1, 1,
"";
> warn unpack "C", $s;
> Can a pure-Perl programmer tell what the output of this
program is without
> trying it?
Not relevant.
> Should he be able to?
No, because the author of this program made a big mistake in
the line
"$sx".
The casual reader can easily figure out that $s was meant as
a byte
string: it is used with unpack "C", which is known
to be a byte
operation. Because it is a byte string, the chr 255 is just
a 0xFF
octet, not a ÿ (ÿ) conceptually.
The casual reader can also easily figure out that x
is meant as a
text string: any codepoint higher than x is always a
character,
never a single byte.
Then, the author of this snippet uses both the byte string
$s and the
text sting "x" joined in one string
"$sx". People not
interested in fixing the code can stop reading there: the
code is broken
and its semantics not terribly relevant. People who wish to
fix it, will
have to try and figure out what the author really wanted to
do here.
Because it's a contrived case, that's very hard to figure
out. But I'm
sure that given real world values and variable names, there
would be a
clear and logical solution, to be found somewhere along the
lines of
encoding and decoding explicitly.
> Thats a broken unicode model
So far, I've only seen a broken understanding of the unicode
model, and
a broken regex engine.
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 16:00:37 |
On Mar 30, 2007, at 12:53 PM, Juerd Waalboer wrote:
> Perl does not have strong typing.
If it is so deadly to collide byte-oriented data with
character data,
it should not be so easy to do so accidentally.
>> Thats a broken unicode model
>
> So far, I've only seen a broken understanding of the
unicode model,
> and
> a broken regex engine.
That so many users, including those as expert as Marc,
possess a
"broken" understanding of Perl's Unicode model
suggests a flawed
design. We have been set up to fail.
(My admiration for the Unicode integration effort remains
undiminished by its flaws.)
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 18:04:51 |
On Fri, Mar 30, 2007 at 09:53:52PM +0200, Juerd Waalboer
<juerd convolution.nl> wrote:
> > > at, know, or set the UTF8 flag yourself. It's
an internal bit, like IOK
> > > and NOK. [1]
> > Thats not how current perl works.
>
> We must have differing definitions, somewhere.
No. I have explained elsewhere that we quite agree on how it
should be. It is
just that you make strange claims:
> The best approach to programming with unicode in mind,
in Perl, is to
> (pretend to) be completely ignorant about Perl's
internals with regards
> to encoding and the UTF8 flag.
It doesn't work even when not having unicode in mind. See
unpack.
> The only exception is the regex engine, which has a big
bug.
Uhm, no.
> Your powers-that-be, might be different. Also, don't
confuse "you can
> know what Perl does internally" with "you
have to know what Perl does
> internally".
In the example I gave, you have to.
> Just being able to access internal metadata doesn't
mean you should
> actually do so on a daily basis.
Whats the alternative? Replace all my uses of unpack with
explicit calls
to ord? Sorry, but thats completely unrealistic.
> It's also entirely possible to set the internal flag
"UTF8" on an
> existing string. But for some reason a lot of people
are complaining
> about that, and even more people have actually set UTF8
flags
> themselves...
Yes. Because you have to when interfacing with a gazillion
of existing
modules (or at the very least clear or downgrade).
If perl wouldn't force people to know the internals so
often, one could
certainly get away with telling them: do not touch
downgrade/upgrade, and
certainly never utf8_on or is_utf8, it is form the dveil.
But thats far from reality.
> > unpack ('CCCCVCC', $$string);
> > that code is broken because the powers to be
decided that "C" exposes the
> > internal encoding, while "V" doesn't.
>
> Yes, any byte-specific operation on a text string
(which I keep separate
> from character strings) will use the internal encoding.
It has to use
> /some/ encoding, because it cannot see whether the
string was meant as a
> byte string or a text string. Perl does not have strong
typing.
Thats wrong. There is a perfectly good definition for
character and byte:
the one from C. It is a single element of a string. The same
thing was true
in perl: one byte is one character, and it should be true
under the new
model.
Nothing in pack or unpack requires a speciifc encoding, just
as nothign in
perl should require me to know the specific encoidng of
"chr 200". It is a
single byte/character, regardles sof how perl stores it
internally.
> Personally, I think that unpack with a byte-specific
signature should
> die, or at least warn, when its operand has the UTF8
flag set.
Thats pure insanity. Then people would again be forced to
know the internal
encoding. How can you tell people to not worry about
internal encoding and in
the next paragraph force them to know because suddenly they
are not allowed
to call unpack unless some _internal_ flag has some specific
value.
I severely doubt you understood perls unicode model: It
works by abstracting
away the internal flag completely, not forcing the user to
deal with it.
Forcing her to deal with it is *wrong*.
> catch at least some of the cases, because the UTF8 flag
always
> positively indicates that the string is a text string.
No, absolutely not. You are confused. The UTF-X flag only
marks a specific
encoding used by perl internally. It says nothing about text
or not text. You
cna store binary just fine in a UTF-X marked string.
> (The reverse,
> however, is not true: a string without the UTF8 string
might be either a
> text string or a byte string.)
As might a string with the UTF-X flag set. Perl is typeless,
it doesn't know
anything about text vs. binary.
> > That requires every perl programmer who decodes
file headers etc.
> > using unpack to know about those internals.
>
> No, it requires every Perl programmer to keep track of
the function of
> every string.
No. A binary string is a binary string because it contains
no characters
higher then 255. It is that simple.
> Byte strings and text strings must never be combined,
and text strings
> must never undergo byte-specific operations.
That is certainly wrong.
> This again requires no knowledge of the actual encoding
that Perl uses
> internally, whatsoever.
It does, for unpack, both in current perl as well as in your
proposed change.
> Note that XS writers must have knowledge of Perl's
internals. This has
> always been true, and is not specific to this fancy new
Unicode thing.
Right. But why gratitiously break old code? In perl, it is
broken by at least
unpack, in XS, it is broken by changing the meaning of
SvPV.
> > And the problem is that those bugs are not
considered bugs but features.
>
> Some bugs are acknowledged as bugs, but won't be fixed
anyway, because
> there is already a lot of code in the wild that depends
on the bugs.
Again, I know a lot of code that is currently broken because
of that
bug. I asked, but nobody found code "in the wild"
that relies on that
specific bug.
> > unpack "C", $s;
>
> The C template for unpack is specifically documented as
byte-specific.
No, it is specifically documented as being
character-specific. Read your
manpage carefully:
c A signed char value.
C An unsigned C char (octet) even under
Unicode.
(Note that byte and character is the same thing in C). That
leavs us with
"octet". An octet is a number between 0 and 255
(you can give alternative
definitions thta are equivalent to mine, though).
In perl this is an octet:
$x = chr 200;
Yet unpack under some circumstances returns two values for
this single
octet, and sometimes not. And the only way to know is to
inspect the
internal UTF-X flag.
> It should never be used on text strings.
Perl is typeless. There is no such thing as a text string in
Perl. The
problem, however, is not that it doesn't work on "text
strings",m whatever
that might be, the problem is that unpack doesn't work on
binary strings,
ro at least not all the time.
> If you properly keep text and byte strings separate,
that means that
> your byte string was never upgraded, and that unpacking
with "C" is
> reliable and predictable.
Uhhh, who guarentees that? JSON::XS does no such thing, and
cannot
guarantee that, because Perl has no type for "text
string" vs. "binary
string". So how do you suggest JSON::XS keeps text and
byte strings
separate, if there is no way to detect the type of a string
or make a
useful difference between those two?
> If upgrading happened even though the string was not
mixed with text
> strings or used with unicode semantics, that is a bug.
I'm very
> interested in these silent upgrades that you are
experiencing.
Concatenating strings might upgrade them (e.g. in debugging
output). More
so, JSON::XS currently can return either UTF-X encoded
strings or non
UTF-X-encoded strings.
You can that buggy. So please tell me how to fix that bug.
How do I, when
decoding a JSON string, know wther it is one of your text or
byte strings?
Whats the difference, if neither JSON nor Perl make one?
> > If you think it is obvious, how about this:
> >
> > my $s = chr 255; # to me, this is one octet. to
perl, it might be one or
> > # two, or maybe more, who
knows.
> > warn unpack "C", $s;
> > "$sx";
> > warn unpack "C", $s;
> > $s .= "x"; substr $s, 1, 1,
"";
> > warn unpack "C", $s;
> > Can a pure-Perl programmer tell what the output of
this program is without
> > trying it?
>
> Not relevant.
Very relevant.
> > Should he be able to?
>
> No, because the author of this program made a big
mistake in the line
> "$sx".
Are you sure that upgraded? And why is it a mistake? I very
much differ in
that-
> The casual reader can easily figure out that $s was
meant as a byte
> string
I cannot, from that short fragment. Neither can Perl.
> it is used with unpack "C", which is known to
be a byte
> operation. Because it is a byte string, the chr 255 is
just a 0xFF
> octet, not a ÿ (ÿ) conceptually.
Exactly. But unpac does not return 255 for that byte
string.
> The casual reader can also easily figure out that
x is meant as a
> text string: any codepoint higher than x is always
a character,
> never a single byte.
Why? Lots of people use those higher codepoints. Perl
certainly does
not mandate anything like that, so why do you try to enforce
it? People
routinely do stuff like join "x", png_images to seperate them, and
it works fine.
Perls unicode model does not enforce a meaning of the
codepoints used in
strings. It simply allows me to use more character indices
than in 5.005.
> Then, the author of this snippet uses both the byte
string $s and the
> text sting "x" joined in one string
"$sx". People not
> interested in fixing the code can stop reading there:
the code is broken
> and its semantics not terribly relevant.
Thanks for gratitiously calling my code broken. In any case,
explain to me
how to fix it in general, I only gave an example of silent
upgrades.
use JSON::XS;
my $x = (from_json to_json [$y])[0];
is another silent upgrade users need to know about.
> People who wish to fix it, will
> have to try and figure out what the author really
wanted to do here.
Exactly that.
> Because it's a contrived case, that's very hard to
figure out.
Not at all. You are just guessing, and getting it wrong.
> sure that given real world values and variable names,
there would be a
> clear and logical solution, to be found somewhere along
the lines of
> encoding and decoding explicitly.
See above, figure it out in the real world then.
> > Thats a broken unicode model
>
> So far, I've only seen a broken understanding of the
unicode model, and
> a broken regex engine.
Same here. Your model requires people knowing about the
UTF-X flag (at
leats in unpack). Mine doesn't, and I think mine is much
closer to what
you want to achieve: not having to tell people about it. In
your model
you would have to tell people to downgrade before unpacking
string, or
alternatively, you rule out a lot of perfectly fine Perl
code on the
assumption that it is easy to figure out that it is broken.
Sorry, but I
differ very much.
--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg goof.com
--==---/ / _ / // / / / http://schmorp.de/
-=====/_/_//_/_,_/ /_/_ XX11-RIPE
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 20:53:25 |
Juerd Waalboer skribis 2007-03-30 21:53 (+0200):
> Personally, I think that unpack with a byte-specific
signature should
> die, or at least warn, when its operand has the UTF8
flag set.
I've since this post changed my mind, and think it should
only warn if
there are wide characters after attempting to downgrade
first. Just like
the existing "wide character in %s" warning.
juerd lanova:~$ perl -wle'$a = "foox";
utf8::upgrade($a); print $a' | hexdump -C
00000000 66 6f 6f ff 0a
|foo..|
00000005
juerd lanova:~$ perl -wle'$a = "foox";
utf8::upgrade($a); print $a' | hexdump -C
Wide character in print at -e line 1.
00000000 66 6f 6f e2 82 ac 0a
|foo....|
00000007
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
[1-4]
|
|