|
List Info
Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-28 04:12:15 |
Darren Duncan skribis 2007-03-27 15:52 (-0700):
> I believe that a true utf8 flag should mean that the
string contains
> data that is valid utf8, not just that it has utf8
characters outside
> the ASCII range.
How often should Perl check for this? Directly after
decoding only, or
also after mutating operations like substr, or s///?
> As far as I know, the conceptual purpose of the utf8
flag is to
> indicate whether Perl considers a string to be
unambiguous character
> data or binary data which could be ambiguous character
data, and thus
> how Perl will treat it by default.
The *conceptual* purpose of the UTF8 flag isn't there.
Conceptually,
every string can be a unicode string, and you're not
supposed to look
at, know, or set the UTF8 flag yourself. It's an internal
bit, like IOK
and NOK. [1]
> confess q{Bad arg; Perl 5 does not consider it
to be a char str.}
> if !Encode::is_utf8( $v );
As said, this is not the purpose of the flag, and you're not
supposed to
use is_utf8 for this. It is documented with the
"[INTERNAL]" flag, for a
good reason.
Perl conceptually has a single numeric type, and a single
string type.
The distinction between integer and float, and between
iso-8859-1 and
utf-8, is internal.
This could be changed, but will introduce incompatibilities
and a severe
loss of performance for strings that fit in iso-8859-1.
What I want (and I think you want too) is a real type
system, to have
two different distinct types: byte strings and character
strings. It
would be bad to use a flag called "UTF8" for this,
because a byte string
can also be UTF8 encoded. Perl already suffers from this
problem, but
because the UTF8 flag is *INTERNAL*, it's not a big deal. It
would be if
it surfaced and was used by Perl coders.
A whole type system is a bit too much to implement in Perl
5, I think.
Our current unicode string semantics are a great way to deal
with not
having types, in my opinion.
> Instead, the older documented utf8 flag behaviour would
require this
> unnecessary extra work in order to accept all valid
input:
No.
If your subroutine expects text, it can only assume that it
gets text,
and it should not (must not?) make any distinction based on
the internal
encoding.
The string it gets is a Unicode string. Not a UTF8 string,
not a latin1
string.
> if !Encode::is_utf8( $v ) and $v =~
m/[^x00-x7F]/xs;
This check is wrong. If the flag is not set, that means only
that the
internal encoding is iso-8859-1 if the string is a text
string, not that
the string is a byte string.
The reverse is true, however: if the flag is set, the string
will not be
a byte string. But lack of UTF8 flag is no indication of
byte versus
character.
> I would expect the use of the regular expression, which
would be
> called for any ASCII data
Note that Perl internally uses iso-8859-1 (8 bit) and utf-8
(variable
whole-octet), not ascii (7 bit).
The character é (eacute) may be stored internally as the
single octet
233 (decimal) and does not by itself cause an internal
upgrade to UTF-8.
[1] Some parts of Perl break this concept. The regex engine
is one of
them, and has different semantics depending on the presence
of the flag.
This is a bug, but any fix would be incompatible.
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 07:02:32 |
On Wed, Mar 28, 2007 at 11:12:15AM +0200, Juerd Waalboer
<juerd convolution.nl> wrote:
> > As far as I know, the conceptual purpose of the
utf8 flag is to
> > indicate whether Perl considers a string to be
unambiguous character
> > data or binary data which could be ambiguous
character data, and thus
> > how Perl will treat it by default.
>
> The *conceptual* purpose of the UTF8 flag isn't there.
Conceptually,
> every string can be a unicode string, and you're not
supposed to look
> at, know, or set the UTF8 flag yourself. It's an
internal bit, like IOK
> and NOK. [1]
Thats not how current perl works.
> Perl conceptually has a single numeric type, and a
single string type.
> The distinction between integer and float, and between
iso-8859-1 and
> utf-8, is internal.
I would love if that were the case, but the powers to be
decided that every
perl progarmmer has to know those internals, and needs to be
able to deal
with them.
> Note that Perl internally uses iso-8859-1 (8 bit) and
utf-8 (variable
> whole-octet), not ascii (7 bit).
No, Perl exposes this. For example, see the recent example
of Compress::Zlib:
unpack ('CCCCVCC', $$string);
that code is broken because the powers to be decided that
"C" exposes the
internal encoding, while "V" doesn't. That
requires every perl programmer
who decodes file headers etc. using unpack to know about
those internals.
This is especially bad as not only has the meaning of
"C" been shifted from
decoding bytes to something else (instead of using a new
modifier), but no
alternative has been provided to get the old meaning of
"C", so basically all
code that doesn't utf8::downgrade is broken now by this
change in meaning.
(Worse is the fact that its wrongly documented to decode an
octet even in
the presence of Unicode, but it doesn't decode an octet,
unless you define
"octet" in Perl to mean that "xa0" is
either one or two octets)
The same is true for many XS modules: in older versions of
perl, SvPV gave
you the 8-bit version of a scalar, but in current versions,
it randomly
gives you either 8-bit or utf-8 encoded. SvPV was renamed to
SvPVbyte.
Both of those gratitiously backwards-incompatible changes
break lots of
existing code.
And the problem is that those bugs are not considered bugs
but features.
> [1] Some parts of Perl break this concept. The regex
engine is one of
> them, and has different semantics depending on the
presence of the flag.
> This is a bug, but any fix would be incompatible.
In fact, some parts of perl break this concept and make
perfectly working
code (in 5.005) not working anymore, or working randomly,
and thats not
considered a bug.
I wonder why it is ok to break large amounts of perl and xs
code silently,
without even documenting how to fix it[1], while at the same
time 5.10
introduced "use feature" to shield against
possible breakage with far less of
an impact then the changes above.
[1] If it is documented, then anybody please show me why
this:
utf8::downgrade $s;
unpack "C", $s;
is documented to have different effects from:
unpack "C", $s;
i.e., where is it documented that perl doesn't upgrade the
scalar in between
those lines? If you think it is obvious, how about this:
my $s = chr 255; # to me, this is one octet. to perl, it
might be one or
# two, or maybe more, who knows.
warn unpack "C", $s;
"$sx";
warn unpack "C", $s;
$s .= "x"; substr $s, 1, 1,
"";
warn unpack "C", $s;
Can a pure-Perl programmer tell what the output of this
program is without
trying it? Should he be able to? I would say the answer is
no to both.
It is beyond me how people can introduce so much breakage to
existing code
so lightly, forcing many modules to be changed and forcing
pure-Perl
programmer to understand the perl interpreter sources to get
their unicode
right.
Thats a broken unicode model, and as long as those kind of
bugs are
considered features, perl programmers very well have to care
about that
internal utf-x, utf-8, whatever flag.
--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg goof.com
--==---/ / _ / // / / / http://schmorp.de/
-=====/_/_//_/_,_/ /_/_ XX11-RIPE
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 19:13:29 |
Marc Lehmann skribis 2007-03-31 1:53 (+0200):
> So you force people to know about the internal flag,
lest they cannot avoid
> the die.
No, you don't have to know about the UTF8 flag, just that
Perl can't
always know if your string is a text string, but is there to
help you
when it does.
> > Besides that, the "C" in Perl's pack()
is documented as a single byte.
> "A C "char" is a byte".
> Your words.
> But here you say a byte is not a character. Thats a
contradiction.
"C char" ne "Perl character".
> No, I asked for UTF-8 encoded characters. Again, read
the documentation:
> * If the pattern begins with a
"U", the resulting string will
> * be treated as UTF-8-encoded Unicode.
Resulting string, not input string.
The word "internally" is missing here. I will do
my best to correct
that.
> thats for pack, unfortunately.
> U A Unicode character number. Encodes to
UTF-8
> internally
> uh, that internal thing again. So how many characters
will pack "U", 200
> give me? According to the documentation, 2, as UTF-8
requires that.
One character. Note again that "character" isn't
the same as a "C char".
We in Perl land, and the people over in Unicode land, use
different
words, sometimes.
Most of the time, a Perl "character" means
codepoint.
> > > Right, while the documentation on unpack
"U" disagrees with it, as it talks
> > > about UTF-8.
> > That would be a bug, but I can't find it in my
copy (5.8.8). It only
> > says "Encodes to UTF-8 internally" for
pack(), which as far as I can
> > tell, is true.
> So it talks about using UTF-8, so, according to you, it
is a bug. Fine
> with me.
This was for pack, you were talking about unpack. Also, the
word
"internally" was probably not added without
reason.
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 20:03:21 |
Marc Lehmann skribis 2007-03-31 2:42 (+0200):
> Repeating wrong statements does not make them true.
I'll refrain from the obvious response.
> No, please look at my example of JSON.
JSON is pretty big to just quickly examine. I have nothing
set up for
testing it.
> > I'm constantly very explicitly and verbosely
telling people to NOT look
> > at the flag, NOT set it manually, etcetera.
> So why do you propose that people have to make sure
that they never put a
> binary string with the UTF-X flag set into unpack?
Not unpack in general, but unpack "C".
Because "C" is explicitly catered for byte data,
which strings with the
UTF8 flag aren't. It won't always catch mistakes, because
indeed lack of
the flag says nothing, but it can help catch some of them.
Perl already has a similar warning in many places, for
example when you
print such a "wide character" on a filehandle that
has no encoding or
utf8 layer. Some modules, like MIME::Base64, provide the
same
functionality.
> How are users supposed to do that, unless they know
about he flag in the
> first place?
By keeping byte strings and text string separate. Please
either accept
this, or stop asking me questions that will lead to this
answer.
> Right, and then you want perl functions to die
depending on the setting of
> that flag, even though you also claim Perl users should
not need to know
> about it.
The warning would not be a new feature, but an existing
feature applied
in more places. "die" is probably too harsh
indeed.
> So you tell users when they get that error message that
they did somethign
> wrong that they should not care about?
When they get the error message, they can read the following
in
perldiag:
Wide character in %s
(W utf8) Perl met a wide character (>255) when
it wasn’t expecting one. This warning is by default on
for I/O
(like print). The easiest way to quiet this
warning is simply to add the ":utf8" layer to the
output, e.g.
"binmode STDOUT, ’:utf8’". Another
way to turn off the warning is to add "no warnings
’utf8’;" but that is
often closer to cheating. In general, you are
supposed to explicitly mark the filehandle with an
encoding,
see open and "binmode" in perlfunc.
Changing the order of these sentences is on my to-do list.
Note how this clear explanation doesn't mention the UTF8
flag!
> As I said, I have no such manpage
See bleadperl or Google.
> You want perl functions to behave different depending
on wether that flag is
> set or not. I want perl functions to behave the same,
regardless of the fact.
I want Perl to warn about certain mistakes when it can.
> > That's not what I said, nor what I meant. In fact,
quite the opposite.
> So then unpack should not croak when it sees the UTF-X
flag?
No, it should warn instead. From now on, I no longer think
it should die. It
should warn, and people who want it to die can do so with
"use warnings FATAL".
> > > The problem is that some parts of perl make a
difference bewteen the
> > > very same string, depending on how it is
encoded internally, _even if
> > > the encoding is the same on the Perl level_.
> > Those are bugs. Report them, and they might get
fixed.
> I did. Thats the whole point of this thread. I reported
them a number of
> times. How could you miss that?
I don't usually read bug reports, and never claimed to have
done so.
But in this special case, I will make an exception, and read
the Unicode
related bug reports that you have submitted.
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-31 05:08:30 |
On Sat, Mar 31, 2007 at 01:53:48AM +0200, Marc Lehmann
wrote:
> In C, a single byte is a character, even if it happens
to have a value
> higher than 255 (although very few compilers allow
that, usually, a byte
> is an octet, although it is common on DSPs to have 32
bit bytes).
>
> Even if Perl encoded a single character into multiple C
bytes/octets, that
> does not mean its more than a single character.
>
> The documentation is completely contradictory when it
comes to "C" and can
> easily be interpreted to mean a single character in the
C sense.
>
> Fact is "even under Unicode" it doesn't work
as advertised, becasue Unicode
> can be internally represented in multiple ways in
Perl.
>
> > I think that "char value" should be
either removed from perlfunc, or
> > explained in more detail. It's NOT OBVIOUS to
those who don't know C.
>
> To those who do know C it has perfectly clear meaning,
namely a single
> character.
http://www.parashift.com/c++-faq-lite/intrins
ic-types.html#faq-26.3
But that is not really relevant to the discussion.
Communication is difficult if you cannot express clearly
what you are
trying to say. Terminology is important to get correct, and
it is easy
to confuse others or yourself if you are not precise when
you need to
be.
Unicode does not even HAVE characters, it has codepoints.
This did not
happen by accident and is an important distinction to make.
$x = "ABCD";
$x = "x41x42x43x44";
$x = chr(65) . chr(66) . chr(67) . chr(68);
$x = pack("C*", 65, 66, 67, 68);
All of these put the same data into $x. [1] We can
reasonably assume
that $x contains a sequence of 4 bytes, each 8 bits wide.
We do not
know anything about what $x is, if it has an encoding, if it
is actually
the output of pack "V", or maybe it came after
"HTTP/1.1 GET ". The
only reasonable thing to assume is that it is just a
sequence of octets,
aka binary data.
Now consider the case of
$y = chr(1000);
Clearly whatever is in $y cannot be a single octet. The way
Perl
currently works (and this is my limited understanding here -
someone
with more knowledge can feel free to step in and correct my
errors)
is that now $y is considered to be a string of Unicode
codepoints. So
$y contains a single codepoint, U+03E8. The internal flag
is used to
indicate that the internal data pointer points to something
that is a
"Unicode codepoint string".
What can we do with such a string? We can try to print it,
but if we
have not converted it we get a message like
Wide character in print at - line 1.
and we get the bytes "cf a8" as output because
that is the internal
encoding.
print unpack("H*", $y);
produces "cfa8" as output, again because we have
been given access to
the string as it exists upgraded.
On the other hand,
print unpack("H*", pack("C", 1000));
produces "e8".
So consider again:
unpack("C*", $y);
This currently produces the list (207, 168) which is again
the internal
encoding. What else should it do? If you expect values
over 255, then
you should not use "C". If you don't have values
over 255, then why is
your string not just a sequence of bytes? Something must
have occurred
to upgrade it to "sequence of unicode
codepoints".
Of course if you have values over 255 you have to use
"U" in unpack,
that only makes sense! On the other hand, if you are
agnostic to your
string and just treat it as "data" then it will
never get upgraded. So
where is the issue?
It sounds to be that what you are trying to suggest is
something along
the lines of another type of Sv for the case of
"unicode codepoint
sequence", so that SvPV implicitly means "This
scalar is not upgraded
and is just data" and SvP_UnicodeArrayValue_ would
contain the upgraded
value. Then for anything that wanted a SvPV (XS code,
unpack "C") the
only sensible thing would be to try to downgrade the string
at that
point and then emit a warning in the case of "wide
characters" being
present.
This is the point at which someone more familiar with
internals chimes
in and says "This has problems [backwards
compatibility, tuits, other]."
And of course this would preclude being able to inspect
Perl's internal
Unicode representation using unpack "C".
--
-Ben Carter
Human beings, who are almost unique in having the ability to
learn from
the experience of others, are also remarkable for their
apparent
disinclination to do so. - Douglas Adams, "Last Chance
to See"
[1] I am deliberately ignoring the box in the corner labeled
"EBCDIC".
|
|
[1-5]
|
|