|
List Info
Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 18:27:14 |
Marc Lehmann skribis 2007-03-31 0:41 (+0200):
> The reason I wanna know is because I want to know what
to tell
> people. Either it is "your code is broken, unpack
"C" without downgrade
> is a bug in your code" or "it is a bug in
perl, you can work around by
> enabling ->shrink for the time being".
If a downgrade is "needed", it means that your
byte string was
accidentally upgraded. This should only happen if you mix it
with a text
string. If it happens without mixing it with a text string,
that is a
bug. Please report.
So, neither "your code is broken, unpack "C"
without downgrade is a bug
in your code" nor "it is a bug in perl".
Instead: "your code is broken, don't mix text strings
with byte strings"
or "it is a bug in perl that your string got upgraded
in the first
place."
> Exactly. But "C" somehow works on UTF-8,
while it shouldn't.
Agreed!
Things that specifically handle bytes, and bytes only,
should DIE (or at
least warn) when used with a string that has the UTF-8 flag
on. This
still lets users get away with naively assuming that byte ==
character
for latin1 strings, as designed, but at least catches the
cases when you
know that the user does something stupid.
> It should work on characters, as documented (just like
in C, char
> array[]; array[i] is one character, regardless of how
many bits a
> character in C has, or how it is encoded).
A C "char" is a byte, not a multibyte character,
ever.
Besides that, the "C" in Perl's pack() is
documented as a single byte.
I think that "char value" should be either removed
from perlfunc, or
explained in more detail. It's NOT OBVIOUS to those who
don't know C.
> > * The chr and ord functions work on characters
> > chr(1).chr(20).chr(300).chr(4000) eq
v1.20.3000.4000
> > In other words, chr and ord are like
pack("U") and unpack("U"), not like
> > pack("C") and unpack("C").
In fact, the latter two are how you now emulate
> > byte-orientated chr and ord if you're too lazy
to use bytes.
> So due to that documentation insanity it is now
suggested that all code that
> used "C" beforee muts use "U" now
to get the same effect as in earlier perl
> versions?
The earlier Perl versions didn't support character values
greater than
255, and if you never have those characters, C still works
perfectly.
But yes, if you're dealing with characters and want your
program to be
able to handle those fancy new >255 characters, you
should change that C
to a U.
> Besides, perl 5.8 does not follow that description:
> perl -e '$x = "xc3xbc"; die unpack
"U*", $x'
> This gives me 195188, two characters, although it is a
single UTF-8
> character, so why does it wrongly give me two? $x
certainly is utf-8-encoded
> (try Encode::encode_utf8 chr 252, it results in the
above string).
You asked for the codepoints U+00C3 and U+00BC, and got
them.
It's a UTF-8 encoded byte string, alright, but "U"
is for Unicode, not
UTF-8.
> Ok, so I will tell people to replace "C" by
"U" in theor code then.
If they do Unicode text strings, that's indeed very good
advice.
But you still want C for byte strings, simply because some
protocols or
formats expect a byte value.
> Right, while the documentation on unpack "U"
disagrees with it, as it talks
> about UTF-8.
That would be a bug, but I can't find it in my copy (5.8.8).
It only
says "Encodes to UTF-8 internally" for pack(),
which as far as I can
tell, is true.
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 18:53:48 |
On Sat, Mar 31, 2007 at 01:27:14AM +0200, Juerd Waalboer
<juerd convolution.nl> wrote:
> If a downgrade is "needed", it means that
your byte string was
> accidentally upgraded. This should only happen if you
mix it with a text
> string. If it happens without mixing it with a text
string, that is a
> bug. Please report.
Thats extrenely far from reality. Lots of things can cause a
text string
to be upgraded. Forcing people to learn all that is just
stupid when you
could just make it work logically without telling people
about internals
(note that the internals come into play by your peculiar
efinition of
"tetx strings" having the UTF-X bit set, which
isn't reality and in my
opinion is an extremely stupid limitation that 96% of perl
does not
follow).
> Instead: "your code is broken, don't mix text
strings with byte strings"
> or "it is a bug in perl that your string got
upgraded in the first
> place."
See my json example. Nothing gets mixed.
> > Exactly. But "C" somehow works on UTF-8,
while it shouldn't.
>
> Agreed!
>
> Things that specifically handle bytes, and bytes only,
should DIE (or at
> least warn) when used with a string that has the UTF-8
flag on.
So you force people to know about the internal flag, lest
they cannot avoid
the die.
This completely contradicts your claim that you want to
abstratc the UTF-X
flag away from the Perl level.
> still lets users get away with naively assuming that
byte == character
> for latin1 strings, as designed, but at least catches
the cases when you
> know that the user does something stupid.
But the user does not do anythign stupid when feeding binary
strings (my
definition, indices 0..255) into Compress::Zlib. It is only
your request
for a die that makes problems. Zlib would work just fine if
perl gave
downgraded data to perl and XS code that wants it.
> > It should work on characters, as documented (just
like in C, char
> > array[]; array[i] is one character, regardless of
how many bits a
> > character in C has, or how it is encoded).
>
> A C "char" is a byte, not a multibyte
character, ever.
Exactly. The same as in Perl I would assume, as Perl uses
characters to
store bytes, it doesn't use multibyte characters on the Perl
level.
Hope you get it this time
> Besides that, the "C" in Perl's pack() is
documented as a single byte.
"A C "char" is a byte".
Your words.
But here you say a byte is not a character. Thats a
contradiction.
You are deeply confusing the internal encoding Perl uses
(Which might be
single octets for characters, or UTF-X encoded octets, for
characters)
with the language proper.
In C, a single byte is a character, even if it happens to
have a value
higher than 255 (although very few compilers allow that,
usually, a byte
is an octet, although it is common on DSPs to have 32 bit
bytes).
Even if Perl encoded a single character into multiple C
bytes/octets, that
does not mean its more than a single character.
The documentation is completely contradictory when it comes
to "C" and can
easily be interpreted to mean a single character in the C
sense.
Fact is "even under Unicode" it doesn't work as
advertised, becasue Unicode
can be internally represented in multiple ways in Perl.
> I think that "char value" should be either
removed from perlfunc, or
> explained in more detail. It's NOT OBVIOUS to those who
don't know C.
To those who do know C it has perfectly clear meaning,
namely a single
character.
> The earlier Perl versions didn't support character
values greater than
> 255, and if you never have those characters, C still
works perfectly.
Nothing in C limits you to 256 characters. A byte in C is
exactly a
character. It can store at least 256 different values, but
nothing in C
limits you to that, many compilers use larger bytes. And the
same is true
in Perl: Perl only supported bytes 0..255 in earlier
versiosn, and now the
perl byte can be up to 64 bits (or maybe a bit less, I
forgot).
> But yes, if you're dealing with characters and want
your program to be
> able to handle those fancy new >255 characters, you
should change that C
> to a U.
I do not want to handle those fancy >255 characters. I
only want to handle
a single octet. But unpack doesn't do that.
In fact, thats thr problem: all old code that uses unpack
"C" would need
to be changed to use "U". Thats the compatibility
breakage I was talking
about. Code that uses "C" expects the single-octet
meaning form perl
5.005, it does not expect the "sometimes returns half
of a utf-x encoded
character, sometimes not" meaning it has in current
perls.
It is especially weird as it suddenly has become
incompatible with regards
to the other template characters such as "n",
which correctly decode
bytes regardless of internal encoding.
> > Besides, perl 5.8 does not follow that
description:
> > perl -e '$x = "xc3xbc"; die unpack
"U*", $x'
> > This gives me 195188, two characters, although it
is a single UTF-8
> > character, so why does it wrongly give me two? $x
certainly is utf-8-encoded
> > (try Encode::encode_utf8 chr 252, it results in
the above string).
>
> You asked for the codepoints U+00C3 and U+00BC, and got
them.
No, I asked for UTF-8 encoded characters. Again, read the
documentation:
* If the pattern begins with a
"U", the resulting string will
* be treated as UTF-8-encoded Unicode.
thats for pack, unfortunately.
U A Unicode character number. Encodes to UTF-8
internally
uh, that internal thing again. So how many characters will
pack "U", 200
give me? According to the documentation, 2, as UTF-8
requires that. That
is not what happens, though.
Thats the problem. Perfectly working code using unpack
"CN" suddenly
stops working because "N" works on bytes, while
"C" works on the internal
encoding, regardless of what that might be.
> It's a UTF-8 encoded byte string, alright, but
"U" is for Unicode, not
> UTF-8.
You cna store unicode in UTF-8. IF you say "UTF-8
encoded unicode" then you
very well have UTF-8, even though it still is unicode.
> > Ok, so I will tell people to replace "C"
by "U" in theor code then.
>
> If they do Unicode text strings, that's indeed very
good advice.
Unfortunately, thats what they have to do when dealing with
binary
strings, as C doesn't work on them.
> But you still want C for byte strings, simply because
some protocols or
> formats expect a byte value.
Exactly. And then I have to use "U" to get it.
Because a byte in perl is a
character. Is and always has been, just as in C.
And to get those bytes for use in such protocols you have to
use "U" now,
instead of "C" as in earlier versions.
> > Right, while the documentation on unpack
"U" disagrees with it, as it talks
> > about UTF-8.
>
> That would be a bug, but I can't find it in my copy
(5.8.8). It only
> says "Encodes to UTF-8 internally" for
pack(), which as far as I can
> tell, is true.
So it talks about using UTF-8, so, according to you, it is a
bug. Fine
with me.
--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg goof.com
--==---/ / _ / // / / / http://schmorp.de/
-=====/_/_//_/_,_/ /_/_ XX11-RIPE
|
|
[1-2]
|
|