|
List Info
Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 15:09:29 |
Marc Lehmann skribis 2007-03-30 14:24 (+0200):
> In fact, I teach a lot of people about unicode in
perl.
At the German Perl Workshop, I saw your unicode
presentation. I don't
know if this is a good representation for your teaching of
unicode, but
I noticed that you used utf8::encode and utf8::decode, not
the similar
functions from Encode.pm that are more commonly used and
advised. These
utf8:: in-place encode/decode functions are efficient, but
using them
means that the same SV changes from byte string to text
sting or vice
versa, which makes the code hard to follow, and any attempt
to use
hungarian notation in code examples impossible.
Whenever I teach the Perl Unicode model, I try to call my
strings
$byte_string and $text_string, or similar. But
utf8::decode($byte_string) makes $byte_string a text string,
and
utf8::encode($text_string) makes $text_string a byte string,
so after
these statements, the names are no longer correct.
(And of course, I try not to teach people the Unicode model,
because
that's something that's quite internal. I try to teach the
difference
between text strings and byte strings, and how to use
encodings (which
are byte representations of text strings). I treat UTF-8
exactly the
same way as KOI8-R. That helps a lot!)
> If perl had the abstract model juerd dreams of
and uses in day-to-day coding, without encountering ANY of
the problems
that you describe (only the regex engine still manages to
surprise me,
but that's because I'm too stubborn to utf8::upgrade
explicitly).
It kind of makes one wonder if this dream might be reality
(and your
reality a dream?)
> then perl would have a very easy unicode model that
boils down to
> what I talked about on the perl workshop: encode/decode
when doing
> I/O, oherwise, enjoy.
And keep text strings and byte strings
separate!!!!!!!!!!!!!eleven
Whenever you must mix text strings and byte strings,
consider the byte
strings I/O and encode/decode accordingly.
So, recap: encode/decode when doing I/O, keep text strings
and byte
strings separate, otherwise, enjoy.
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 17:32:49 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Moin,
On Friday 30 March 2007 20:09:29 Juerd Waalboer wrote:
> Marc Lehmann skribis 2007-03-30 14:24 (+0200):
> > In fact, I teach a lot of people about unicode in
perl.
>
> At the German Perl Workshop, I saw your unicode
presentation. I don't
> know if this is a good representation for your teaching
of unicode, but
> I noticed that you used utf8::encode and utf8::decode,
not the similar
> functions from Encode.pm that are more commonly used
and advised. These
> utf8:: in-place encode/decode functions are efficient,
but using them
> means that the same SV changes from byte string to text
sting or vice
> versa, which makes the code hard to follow, and any
attempt to use
> hungarian notation in code examples impossible.
However, if you have 200Mbyte of ASCII string, it is more
efficient to *not*
copy the data around just to find out that, yes, all of it
is 7bit
But otherwise, I basically agree with you.
All the best,
Tels
- --
Signed on Fri Mar 30 22:31:28 2007 with key 0x93B84C15.
View my photo gallery: http://bloodgate.com/phot
os
PGP key on http://bloodgate.com/te
ls.asc or per email.
"Retsina?" - "Ja, Papa?" - "Schach
Matt." - "Is gut, Papa."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iQEVAwUBRg2QGXcLPEOTuEwVAQIy+Qf8DETRmN30yEFJSgd2yO8kezpOiT6y
ErsB
c2EUa0XJ1nl+pEQ1givBZ4Y/Ci7QlfeyuDFCBL30Ld1JKPBqP2p6AJgwoOAK
k2VU
AQcnTUloimSqzanuzs8+v5S7APUDQbBuEpaxliepHuMAJvfxFjN81A8nWcXD
WNUO
XG/YSLiDvoZoj8RE5rpE5DQ7hIuoyxq/h6fBlIwNB7ATl3XOPC8Ji8rKCIgl
zW88
DNHiovC0Mo5V6VNE2tYfKlkxZBm1qtOjenUurgUjdh4NoivyxAg9CCvbFoWD
E6f4
zeOio94e9JEf5e4ZlK+plwqFSonpVbRO7Fdk1EcxrjG5sQaaBNzNxQ==
=a8Qq
-----END PGP SIGNATURE-----
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 17:20:34 |
On Fri, Mar 30, 2007 at 10:09:29PM +0200, Juerd Waalboer
<juerd convolution.nl> wrote:
> Marc Lehmann skribis 2007-03-30 14:24 (+0200):
> > In fact, I teach a lot of people about unicode in
perl.
>
> At the German Perl Workshop, I saw your unicode
presentation. I don't
> know if this is a good representation for your teaching
of unicode, but
It is, if a bit short (and I consider it a matter of
taste).
> > If perl had the abstract model juerd dreams of
>
> and uses in day-to-day coding, without encountering ANY
of the problems
> that you describe
Frankly, that is not a very good sign. It means eitehr you
are extremely
lucky or you don't use any of the many XS modules that
silently break, or
even the Perl modules (such as the example from
Compress::Zlib) that break
less silently, but more miraciously.
> It kind of makes one wonder if this dream might be
reality (and your
> reality a dream?)
The dream isn't reality. If it ere, people would not report
bugs against
JSON::XS because it happens to create scalar values with the
UTF-X bit set.
And they do so for some of my other modules doing that, too.
And there are
two options to me: either tlel them perl is broken w.r.t. to
e.g. "C", or
their code is broken becasue they do not call downgrade.
Obviously, I prefer the former over the latter, but last
time I was told
unpack "C" was mentioned to break the abstraction
in the camelbook, so its
correct.
Which suddenly invalidates a lot of code.
> > then perl would have a very easy unicode model
that boils down to
> > what I talked about on the perl workshop:
encode/decode when doing
> > I/O, oherwise, enjoy.
>
> And keep text strings and byte strings
separate!!!!!!!!!!!!!eleven
I find "text strings" and "byte strings"
not adequate either, as Perl
makes no difference between those two concepts (being
typeless), and
they do not map well to encoded/decoded text either. Perl
only knows
how toc oncatenate characters, it does not know anything
about byte or
text, so utf8::encode does not necesarily create a byte
string out of a
text string. It could juts as well create a text string out
of a byte
string (think JSON, which creates json _text_ out of e.g.
byte strings by
encoding them to UTF-8).
> So, recap: encode/decode when doing I/O, keep text
strings and byte
> strings separate, otherwise, enjoy.
I do not think that maps clearly to Perl (or my programs
either). It might
be a good and simplified advice to a beginner, though,
although I prefer
to never tell people simplified (but wrong) things. The perl
unicode model
is rather simple, but leaves you in control, and I found
teaching people
about how perl just allows more than 0..255 for a character
index works
best (although people differ).
--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg goof.com
--==---/ / _ / // / / / http://schmorp.de/
-=====/_/_//_/_,_/ /_/_ XX11-RIPE
|
|
[1-3]
|
|