List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 19:19:16
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Friday 30 March 2007 21:44:12 Juerd Waalboer wrote:
> Tels skribis 2007-03-30 23:17 (+0000):
> > > If it is so deadly to collide byte-oriented
data with character data,
> > > it should not be so easy to do so
accidentally.
> >
> > It can happen everytime you concatenate two
strings. Maybe we could add
> > a new warning?
>
> Eh, no, because Perl does not have any metadata telling
you if this
> non-UTF8 string is a latin1 text string, or just a
random byte string.
>
> There is no way to tell Perl how you intended your
string to be used,
> and there is no way for Perl to tell you the same thing
about a string
> it returned.
>
> > 	use warnings 'upgrade';
>
> This already exists on CPAN, authored by Audrey Tang,
as
> encoding::warnings:
>
>     use encoding::warnings;
>
> But it will warn when Perl upgrades latin1 to utf-8,
without knowing if
> that is a bug or a feature, because it doesn't know if
the "latin1"
> string was meant as a text string or a byte string.
>
> It's a useful debugging tool, to find unintended
upgrades, but you
> shouldn't try to avoid upgrading altogether. That just
hurts, because
> upgrading is part of the way the Perl Unicode model was
intended.
>
> > 	* the lenght in bytes
> > 	* the lenght in characters (not always set, e.g.
can be unknown)
> > 	* the storage buffer (containing the data, plus
some optional padding)
> > 	* the encoding
>
> Hey, cool, Perl has almost the same thing, only it
supports just two
> encodings: latin1 and utf8. It uses a single bit to
indicate the
> encoding, the UTF8 flag, which can be on or off. When
it's off, the
> string is latin1, when it's on, the string is UTF-8.
>
> Maybe you should try Perl; you'll like the way it's
built, because it
> very closely matches your own design!

First for the record:

The application I am outfitting is written in C, for speed,
and quite large. 
So there is NO way I would even consider to rewrite it in
Perl. I'm just 
using the right tool for the right job. That doesn't mean I
do not like 
Perl, or the way Perl does things. Sorry if this sounded
like it.

Anyway, I wasn't aware that any non-utf8 data in Perl is
*always* 
ISO-8859-1, I thought that, when not specified, this
depended on some other 
stuff. Guess I need to reread the tutorials. 

However, this also poses the question: How does Perl know
that your data is 
in KOI8-R?

(Yes, that's a trick question, but I would like to hear your
answer to that, 
in any case, just to make it clear to me. No offence
meant!)

One of the limitations of the "there can be only two
encodings" of Perl 
seems to be that strings are permanently upgraded:

	$iso_8859_1 = '...';
	$utf8 = '...';

	if ($iso_8859_1 eq $utf8) { ... }

Please correct me if I am wrong, but I do think it is not be
possible to 
keep both variables in their current encoding and only
temporarily upgrade 
them to utf8 (for the common encoding that contains both of
them)?

After reading this discussion here, a lot of problems also
seem to stem from 
the fact that the upgrade to utf8 is permanent, silently and

done "behind-the-scenes". Just like 1 + 2.0 will
result in 3.0 and not 3 
and we all know how much confusion this creates  (heh, I
fell for it 
today, even tho I should have know better 

> The same type of string can be used for binary data,
because in the
> unicode encoding "latin1", all 256 codepoints
map to the same byte
> values.

This sounds like a circular definition, because in CP1250,
also all 256 
codepoints map to the same byte values. Except it are
different byte 
values 

In my application, I also considered having a
"BINARY" encoding, but in the 
end I opted to make ISO-8859-1 the default encoding for
BINARY stuff. (Ha, 
great minds sink alike or so) And since unlike in Perl,
upgradings are 
never done permanently, you can keep your BINARY string and
compare it to 
UTF-8 whatever, and it never gets "corrupted".

I am not sure how one could achive that in Perl. Making the
SV read-only?

> > In short, it becomes a mess.
>
> Yes, with strong typing, especially with string
subtypes for arbitrary
> encodings, it would be cleaner. But it would also not
look like Perl 5.

Over the years, I come to the insight that I want to build
reliable and fast 
programs. (easy to maintain, reliable, fast, pick two 

So maybe we really need "use strict 'encodings';"


All the best,

Tels

- -- 
 Signed on Sat Mar 31 00:04:29 2007 with key 0x93B84C15.
 Get one of my photo posters: http://bloodgate.com/pos
ters
 PGP key on http://bloodgate.com/te
ls.asc or per email.

 "Blogebrity: Wow, guess what this one stands for? Too
easy. Hey, anyone
 can do it: take a blogger who's a chef, and you get: BLEF.
A blogger
 who's a dentist? BENTIST. A female blogger with an itch?
You guessed it:
 a BITCH."

  -- maddox from xmission
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg2pBHcLPEOTuEwVAQKxJQf/UKYZhHUkTkH6wpP/uLQ+zkEO/8pt
DA4i
7lQipjOIkGlcLc0peF0sr2jlNu59XWSVbDeYdSSdJGWYvydYbeToP180xaBm
s40a
GdL/5QWlgUalQ1sifs93r1pfx+AQv1Pc4TivybFj/SbYY5WYe7pcaZDZ80/l
uYtp
ftxd+96KLVshZ/2bMtxjJ7yo2k7oD0uwA2MF1SFiytjSFZZ+QRol2G7PbsIa
qonc
ITDrEm+R+djp9FLFKlXQIs3/jNx2wOhoS5z6Q3HKIi9KrXfMngyZa4cvpSmm
071l
ETbRT4gy+1O7fFvsFG8xrtyajO95LpSPhZ1aeYR7fPpj0zLP6KNqxQ==
=jV6Z
-----END PGP SIGNATURE-----

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 17:38:19
Tels skribis 2007-03-31  0:19 (+0000):
> Anyway, I wasn't aware that any non-utf8 data in Perl
is *always* 
> ISO-8859-1, I thought that, when not specified, this
depended on some other 
> stuff. Guess I need to reread the tutorials. 

Note that they are unicode strings, and that Perl is
theoretically free
to change the internal representation at any time.

> However, this also poses the question: How does Perl
know that your data is 
> in KOI8-R?

Because you tell it that it is with "decode". The
resulting string is a
unicode string, which may have any encoding internally.
(Practically,
this is limited to latin1 and utf8.)

    my $text_string = decode("koi8-r",
$byte_string);

or, if you prefer different terminology:

    my $unicode_string = decode("koi8-r",
$koi8r_string);

> One of the limitations of the "there can be only
two encodings" of Perl 
> seems to be that strings are permanently upgraded:
> 	$iso_8859_1 = '...';
> 	$utf8 = '...';
> 	if ($iso_8859_1 eq $utf8) { ... }

$iso_8859_1 is temporarily upgraded to utf8 for this
comparison.

(Yes, this copies data, and then throws it away. Again,
optimization
does require knowing internals. The easiest optimization
here is to
utf8::upgrade $iso_8859_1, after which the variable name no
longer makes
sense )

> Just like 1 + 2.0 will result in 3.0 and not 3 and we
all know how
> much confusion this creates  (heh, I
fell for it today, even tho I
> should have know better 

Doesn't really cause me any headaches, to be honest.

> > The same type of string can be used for binary
data, because in the
> > unicode encoding "latin1", all 256
codepoints map to the same byte
> > values.
> This sounds like a circular definition, because in
CP1250, also all 256 
> codepoints map to the same byte values. Except it are
different byte 
> values 

I said "unicode encoding", but should have said
"unicode codepoints".

Codepoints 0..256 in latin1 map to byte values 0..256. That
makes it
special.

> > > In short, it becomes a mess.
> > Yes, with strong typing, especially with string
subtypes for arbitrary
> > encodings, it would be cleaner. But it would also
not look like Perl 5.
> Over the years, I come to the insight that I want to
build reliable and fast
> programs. (easy to maintain, reliable, fast, pick two


I do that with Perl. Really, you should check that language
out! You'll
LOVE it! 
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 19:11:22
On Sat, Mar 31, 2007 at 12:19:16AM +0000, Tels
<nospam-abusebloodgate.com> wrote:
> Anyway, I wasn't aware that any non-utf8 data in Perl
is *always* 
> ISO-8859-1, I thought that, when not specified, this
depended on some other 
> stuff. Guess I need to reread the tutorials. 

He, because its not true 

> However, this also poses the question: How does Perl
know that your data is 
> in KOI8-R?

It doesn't. Perl ideally only interprets character indices
as unicode
codepoints (I am ignoring use locale and similar issues
here). So when you
want to match your koi8-r data aginst a regex, you need to
decode it first.
Perl doesn't know that and will *then* treat your character
data as KOI8-R
(and afterwards as unicode).

Unless you force perl to apply unicode interpretations to
your characters,
they are completely encoding-free.

> One of the limitations of the "there can be only
two encodings" of Perl 
> seems to be that strings are permanently upgraded:

Thats the root of the problem. There aren't two encodings.
There is only one:
characters concatenated to form strings.

Internally, Perl currently has two forms for that, just as
perl can store
real integers and doubles in a scalar.

But on the Perl level, "5", "5.0", 5 and
utf8-encoded 5 are all the same
scalar.

> 	if ($iso_8859_1 eq $utf8) { ... }
> 
> Please correct me if I am wrong, but I do think it is
not be possible to 
> keep both variables in their current encoding and only
temporarily upgrade 
> them to utf8 (for the common encoding that contains
both of them)?

It is, but likely not very efficient as in most such cases
you actually
want utf-x internally. Except for optimisation purposes
(where I see
downgrade and upgrade as well-warranted), you do not have to
care, as perl
handles thta automatically.

> After reading this discussion here, a lot of problems
also seem to stem from 
> the fact that the upgrade to utf8 is permanent,
silently and 
> done "behind-the-scenes". Just like 1 + 2.0
will result in 3.0 and not 3 
> and we all know how much confusion this creates  (heh, I
fell for it 
> today, even tho I should have know better 

No, there is no problem in most cases, as the upgrade does
not change the
scalar in any way (except, again, for speed). Or at least
should.

Perl achieves that goal by transparentlxy re-encoding its
internal format
as required. re-coding in that way does not change the
semantics of the
string, except:

- when you hit a bug in perl
- when you use unpack "C".

So in a bug-free perl without unpack, everythign just works
and you never
need to care about wether perl stores the data as UCS-4,
UTF-X or octets
in memory.

Thats the "sane" model introduced with 5.6 and
mostly achieves with 5.8.8.

The problem are thre remainign bugs AND unpack, the latter
of which breaks
existing programs that assume unpack "C" has byte
semantics, when, in
fact, it returns the internal encoding that perl normally
hides from you
and tells you to ignore.

If those remaining problems were fixed (that included SvPV),
the only
difference between utf-x encoding and octet-encoding within
perl would be
speed, but not semantics.

Thats the beauty.

Juerds goal of having the UTF-X flag exposed and having you
to think about
when perl upgrades and downgrades (and making you avoid the
upgrades) is
horrible, as it forces a lot of administration on the
programmer, a lot of
which perl already claims to do, as only in a few cases you
have to know
your UTF-X flag at the moment.

> > The same type of string can be used for binary
data, because in the
> > unicode encoding "latin1", all 256
codepoints map to the same byte
> > values.

latin1 is not a unicode encoding in the first place.

Also, I find it much more natural to represent bytes as
characters 0..255 in
perl, as opposed to Juerds definition of characters 0..255
with the internal
UTF-X flag cleared.

I just don't see why the programmer has to learn about that
internal flag
at all. If he has to, then perl could become much much
faster by forcing
her to do that all the time, instead of only in unpack or XS
cases.

> great minds sink alike or so) And since unlike in Perl,
upgradings are 
> never done permanently, you can keep your BINARY string
and compare it to 
> UTF-8 whatever, and it never gets
"corrupted".

In the 5.5 model, nothing ever gets "corrupted",
too. Thats the beauty of it.
Because scalars with the UTF-X flag set behave the same way
as scalars not
having it set, everything is compatible with each other.

Its only the cases _where_ it makes a difference where this
is a problem
and in fact stuff gets corrupted.

> I am not sure how one could achive that in Perl. Making
the SV read-only?

By fixing the remaining bugs and making the UTF-X flag
truely internal, so
you do not have to worry about modules corrupting your
stuff.

Thats what perl does for you in the vast majority of cases
already, and it
should simply do that all the time, so programmers have
their typeless perl
that they love again.

> > > In short, it becomes a mess.
> >
> > Yes, with strong typing, especially with string
subtypes for arbitrary
> > encodings, it would be cleaner. But it would also
not look like Perl
> > 5.

I beg to differ. Strong typing makes programming hard. Until
Perl6 came and
destroyed it, the typeless nature of Perl was a feature, not
a problem.

Why should perl suddenly introduce types for strings when a
single abstratc
string type works just as wonderful as the single abstract
scalar type works
in perl already?

Having strongly typed integers/doubles/utf-8-strings etc. is
a step
backwards from perl towards Java.

Programmers using Perl do not want to worry about strict
typing. They can
use C++ or Java anytime for that.

> Over the years, I come to the insight that I want to
build reliable and
> fast programs. (easy to maintain, reliable, fast, pick
two 
>
> So maybe we really need "use strict
'encodings';" 

What for, so that your program crashes at runtime instead of
degrading to a
slower but corretc case in case it happens to hit binary
data? You surely do
not want this, or do you?

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcggoof.com
      --==---/ / _ / // / / /      http://schmorp.de/
      -=====/_/_//_/_,_/ /_/_      XX11-RIPE

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )