|
List Info
Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 20:39:06 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Moin,
On Friday 30 March 2007 22:38:19 Juerd Waalboer wrote:
> Tels skribis 2007-03-31 0:19 (+0000):
> > Anyway, I wasn't aware that any non-utf8 data in
Perl is *always*
> > ISO-8859-1, I thought that, when not specified,
this depended on some
> > other stuff. Guess I need to reread the tutorials.
>
> Note that they are unicode strings, and that Perl is
theoretically free
> to change the internal representation at any time.
>
> > However, this also poses the question: How does
Perl know that your
> > data is in KOI8-R?
>
> Because you tell it that it is with "decode".
The resulting string is a
> unicode string, which may have any encoding internally.
(Practically,
> this is limited to latin1 and utf8.)
>
> my $text_string = decode("koi8-r",
$byte_string);
>
> or, if you prefer different terminology:
>
> my $unicode_string = decode("koi8-r",
$koi8r_string);
I thought you would say this
My question was posed because I wanted to know how to *keep*
a KOI8 (or any
other random binary) string in Perl without converting it to
Unicode. It
seems to me this is not easily possible because there are
literally dozend
places where your KOI8 string might get suddenly upgraded to
UTF-8 (and
thus get corrupted because Perl treats it is ISO-8859-1). Or
did I get this
wrong?
In an ideal world, you could either just keep everything in
utf-8 (that's
too slow for some things and not fool-proof either), or rely
on no other
code to corrupt your data - especially this random third
party module you
pulled from CPAN last night.
OMHO the problem arises from the fact that Perl makes no
distinction between
a byte string like "a" and a text string like
"a", and furthermore,
manipulating byte string (for instance appending a byte) is
done with
typical string operators. So:
$byte_string = 'something random bytes';
# works if $y is 7bit and no utf8 flag
# but fails if $y is 7bit with utf8 flag
$byte_string .= $y;
As you said, all is well as long as you can keep these two
beasts seperate,
but the slightest problem might mangle your data. Such as a
decode_utf8
setting the UTF8 bit on a 7bit ASCII string, therefore
changing the 7bit
byte string to a text string.
Hm, maybe one could write a module that always tackles the
encoding to an SV
via magic. And then you could have a special encoding called
"BINARY" (or
absence of an encoding means it is treated as binary), so
that if you ever
try to fuse two strings together where one of them is tagged
binary, you
get an exception (but only then!).
As you said, the current warnings::encode can't decide
between the case
of "BINARY + UTF_8" and "ISO-8859-1 +
UTF_8" as Perl makes no distinction
between binary data and ISO-8859-1. And this missing
distinction is
certainly a bother
> > One of the limitations of the "there can be
only two encodings" of Perl
> > seems to be that strings are permanently
upgraded:
> > $iso_8859_1 = '...';
> > $utf8 = '...';
> > if ($iso_8859_1 eq $utf8) { ... }
>
> $iso_8859_1 is temporarily upgraded to utf8 for this
comparison.
> (Yes, this copies data, and then throws it away. Again,
optimization
> does require knowing internals. The easiest
optimization here is to
> utf8::upgrade $iso_8859_1, after which the variable
name no longer makes
> sense )
Nah, in this case I wanted the temporarily upgrade
> > Just like 1 + 2.0 will result in 3.0 and not 3 and
we all know how
> > much confusion this creates (heh, I
fell for it today, even tho I
> > should have know better
>
> Doesn't really cause me any headaches, to be honest.
Yeah, I am not a genius :/ (Sometimes I wish I could upgrade
my brain
> > > The same type of string can be used for
binary data, because in the
> > > unicode encoding "latin1", all 256
codepoints map to the same byte
> > > values.
> >
> > This sounds like a circular definition, because in
CP1250, also all 256
> > codepoints map to the same byte values. Except it
are different byte
> > values
>
> I said "unicode encoding", but should have
said "unicode codepoints".
>
> Codepoints 0..256 in latin1 map to byte values 0..256.
That makes it
> special.
Erm, I don't buy this because:
Codepoints 0..256 in KOI8-R (to pick one) map to byte values
0.256. That
would make it special, too.
(I don't nec. disagree with you, I just don't understand
what you mean).
> > > > In short, it becomes a mess.
> > >
> > > Yes, with strong typing, especially with
string subtypes for
> > > arbitrary encodings, it would be cleaner. But
it would also not look
> > > like Perl 5.
> >
> > Over the years, I come to the insight that I want
to build reliable and
> > fast programs. (easy to maintain, reliable, fast,
pick two
>
> I do that with Perl. Really, you should check that
language out! You'll
> LOVE it!
Yeah, maybe one day I actually start real programming work
in Perl. ;)
All the best,
Tels
PS: I think this discussion has become a bit off-topic, so
we should
probably keep it off-list. Just for the original topic and
the record, when
you have pure 7bit ASCII data, Perl (decode etc) should not
set the utf8
flag on the data, as that makes things go slower and is just
a waste. In
fact, it shouldn't even copy the data around etc., it should
only make
exactly one run through the data to count the high-bit
bytes.
PPS: Thanx for the discussion, this really helps me to
understand things
better.
P³S: Unrelated to this thread, I was working on
benchmarking Encode and the
ISO-8859-1 to UTF-8 upgrade code. Stay tuned
- --
Signed on Sat Mar 31 01:18:34 2007 with key 0x93B84C15.
View my photo gallery: http://bloodgate.com/phot
os
PGP key on http://bloodgate.com/te
ls.asc or per email.
". . . my work, which I've done for a long time, was
not pursued in
order to gain the praise I now enjoy, but chiefly from a
craving after
knowledge, which I notice resides in me more than in most
other men. And
therewithal, whenever I found out anything remarkable, I
have thought it
my duty to put down my discovery on paper, so that all
ingenious people
might be informed thereof."
-- Antony van Leeuwenhoek. Letter of June 12, 1716
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iQEVAwUBRg27uncLPEOTuEwVAQIkXAf+O+FgERCl2lcyr28XpeLcCl17pKtf
eVBd
kQn/j7sqMGLYuqzcZMrNIn4gKskw8L1T19Q0XcoJBVb4phlHHKrZttmbBrhN
++KA
YfXPd9WH/qg9exYHH/+TDdAWCaJYDYcG2B8xI1NTKrDgwFBt8sJJyt9J2jrJ
oPJE
6rPpAL9vun1wqv6MJeRacxHWmWk7wXflCIrUt9bf8c+feEpMJ51/331Kgb0t
jcFs
85IpfzV9TuFn8I17it//7rPrzJfb1NOSwOcgk/6dj5msIoZv1psmNYZcaysA
IGpu
evEdhAjpmiVh+DSnGRZEoWfzGwoJfVwGCOmoaQ2O44e9u+AVmx6x0A==
=gDih
-----END PGP SIGNATURE-----
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 19:20:52 |
On Sat, Mar 31, 2007 at 01:39:06AM +0000, Tels
<nospam-abuse bloodgate.com> wrote:
> My question was posed because I wanted to know how to
*keep* a KOI8 (or any
> other random binary) string in Perl without converting
it to Unicode. It
> seems to me this is not easily possible because there
are literally dozend
> places where your KOI8 string might get suddenly
upgraded to UTF-8 (and
> thus get corrupted because Perl treats it is
ISO-8859-1). Or did I get this
> wrong?
Yes, you did get that wrong, liekly because Juerd wants
users to care about
that. But in fact, if you try it, nothing will get corrupted
unless you use
unpack "C" to get the first byte of your
KOI8-string. Then you might get
surprised (current perl) or an exception (Juerd's idea).
> In an ideal world, you could either just keep
everything in utf-8 (that's
> too slow for some things and not fool-proof either), or
rely on no other
> code to corrupt your data - especially this random
third party module you
> pulled from CPAN last night.
In an ideal world, you would just want to manipulate bytes
== characters in
Perl, and do not care about how it treats it internally. It
should treat it
as fast as possible, of course.
The same is true for other things in perl: you do not wan
tto care wether
your scalar contains an integer, floatingpoint, or strings.
Use decides that
in perl: if you print an integer scalar, it (also) turns
into a string. If you add
a floating point number to an integer-only scalar, you get
the expected
floatingpoint result.
Perl converts between all those "encodings"
transparently in a way that makes
most sense. And the same thing is true for character data.
There is a small diference, as Perl can have scalars that
have both a string
and a double value, for example, and can then choose the
fastest
representation. Perl could just as well keep both an UTF-X
encoded as well as
a octet-encoded version of string around to optimise for
speed.
Of course, that optimisation would need a lot of memory, so
the trade-off
choosen in the current implementation is to
upgrade/downgrade when needed,
transparently, so your KOI8-bytes stay KOI8-bytes all the
time.
It is the few cases where perl doesn't do that I am
concerned about.
> OMHO the problem arises from the fact that Perl makes
no distinction between
> a byte string like "a" and a text string like
"a", and furthermore,
> manipulating byte string (for instance appending a
byte) is done with
> typical string operators. So:
Yeah. It also makes no difference between numbers and
strings. Thats Perl.
> # works if $y is 7bit and no utf8 flag
> # but fails if $y is 7bit with utf8 flag
> $byte_string .= $y;
>
> As you said, all is well as long as you can keep these
two beasts seperate,
> but the slightest problem might mangle your data. Such
as a decode_utf8
> setting the UTF8 bit on a 7bit ASCII string, therefore
changing the 7bit
> byte string to a text string.
No, only in Juerd's model where binary data encoded in UTF-X
is a bug. In
real-world perl, that just works fine,a dn thats what I
expect, and thats I
think what users expect, too: not having to deal with the
internal types.
In the same way, you do not have a module that converts
numbers to strings,
you just print them:
my $x = 5;
print $x;
Again, pelr transparently handles the details (which
includes(!) character
encoding for the outside world!).
> As you said, the current warnings::encode can't decide
between the case
> of "BINARY + UTF_8" and "ISO-8859-1 +
UTF_8" as Perl makes no distinction
> between binary data and ISO-8859-1. And this missing
distinction is
> certainly a bother
Only when you hit bugs, or unpack.
Greetings,
--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg goof.com
--==---/ / _ / // / / / http://schmorp.de/
-=====/_/_//_/_,_/ /_/_ XX11-RIPE
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 19:33:55 |
Tels skribis 2007-03-31 1:39 (+0000):
> My question was posed because I wanted to know how to
*keep* a KOI8 (or any
> other random binary) string in Perl without converting
it to Unicode. It
> seems to me this is not easily possible because there
are literally dozend
> places where your KOI8 string might get suddenly
upgraded to UTF-8 (and
> thus get corrupted because Perl treats it is
ISO-8859-1). Or did I get this
> wrong?
A koi8r string is a byte string. If you keep it separated
from text
strings properly, it should not be upgraded and thus treated
as latin1.
I'm very curious as to "sudden upgrades" that
aren't related to mixing
with text strings. Should you encounter them, please let me
know.
Indeed, some functions and operations will not work properly
on koi8r,
with regards to character properties. For example, the regex
engine has
no idea which characters are word characters, and which are
cyrillic. It
can only assume it's either ascii or latin1. For full
functionality, you
must decode the string.
If your program is just a gateway in between other things,
and doesn't
do any text processing, just keep the thing a byte string.
Just like $jpeg_image is a byte string that contains JPEG
data, and this
can be safely used, $koi8r_string can be a byte string that
contains
koi8r text data.
> especially this random third party module you pulled
from CPAN last
> night.
Well, yes, modules sometimes have bugs. That's something we
have to
learn to live with.
> As you said, all is well as long as you can keep these
two beasts seperate,
> but the slightest problem might mangle your data.
That is true. Programming can be a delicate job. Has always
been like
that
> Hm, maybe one could write a module that always tackles
the encoding to an SV
> via magic. (...) so that if you ever try to fuse two
strings together
> where one of them is tagged binary, you get an
exception (but only
> then!).
That would be neat. You'd effectively have strong typing. I
don't think
you can do this in a module, though. It requires checks all
over the
place. Maybe Scott Walters' typesafety module can be of help
or
inspiration: htt
p://search.cpan.org/~swalters/typesafety-0.05/
> Yeah, I am not a genius :/ (Sometimes I wish I could
upgrade my brain
But then, it would be much slower! ;)
> > Codepoints 0..256 in latin1 map to byte values
0..256. That makes it
> > special.
> Erm, I don't buy this because:
> Codepoints 0..256 in KOI8-R (to pick one) map to byte
values 0.256. That
> would make it special, too.
I should have said "unicode codepoints 0..255 in latin1
map ...".
The interesting thing about latin1 is that 0..255 overlap
with unicode.
The 0..255 (not 256 btw, silly mistake) in koi8-r can all be
found in
unicode somewhere, but they're not all in exactly the same
places.
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
[1-3]
|
|