List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 16:28:44
Tels skribis 2007-03-30 22:32 (+0000):
> However, if you have 200Mbyte of ASCII string, it is
more efficient to *not* 
> copy the data around just to find out that, yes, all of
it is 7bit 

Indeed, but this is an optimization. Optimization isn't part
of teaching
how things work, it always comes after.

Information overload is probably the single most problematic
thing in
Perl's unicode documentation. Constantly people are told all
those
internal implementation details that they don't have to
know. It's no
wonder that they start assuming that they actually need
this
information, and use manual setting of UTF8 flags as their
first resort
in case of trouble.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 19:03:41
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Friday 30 March 2007 21:28:44 Juerd Waalboer wrote:
> Tels skribis 2007-03-30 22:32 (+0000):
> > However, if you have 200Mbyte of ASCII string, it
is more efficient to
> > *not* copy the data around just to find out that,
yes, all of it is
> > 7bit 
>
> Indeed, but this is an optimization. Optimization isn't
part of teaching
> how things work, it always comes after.

I almost agree. 

Some decisions really need to be done early on, in the
design phase. You 
cannot optimize when the design is broken. E.g. if your data
needs to be 
copied around *per design*, the best you can achive is O(N).
When you do 
not have to copy the data, you suddenly can achive O(1).
This distinctions 
is quite important, and not something you can fix aftwards
apart from 
redesigning (aka let's break and re-assemble it 

A recent (non-Perl) example for such a methodology/design
change was 
zero-copy networking - I remember there being a lot of talk
about this, 
especially in Unix/Linux world. Basically, when you want to
send data to 
the network it is wastefull to copy it many times around
just to output it 
to the hardware - up to the point where the copy takes more
time than all 
the rest of work to be done. However, avoidn the copy isn't
that easy 

I know it is hard to design your code so that it works fine
for small data 
("A") and large data ("A" x 10000000)
alike, but usually, these things need 
to be considered early on, or you end up with a system that
is only usefull 
for demos and toying around and breaks under real-world
access 

Just like security, a performant design usually can't just
bolted on later.

And how to design your program to be secure, ast, reliable
etc. should be 
teached, too. Maybe not in the same hour, but close 

Just saying... 

> Information overload is probably the single most
problematic thing in
> Perl's unicode documentation. Constantly people are
told all those
> internal implementation details that they don't have to
know. It's no
> wonder that they start assuming that they actually need
this
> information, and use manual setting of UTF8 flags as
their first resort
> in case of trouble.

I think I agree. Luckily I managed to completely avoid this
whole issue by 
ignoring unicode until very recently - and then the doc and
code had 
improved quit a lot so that Unicode is really usable in Perl
(Thank you 
guys! especially Jarkko!)

All the best,

Tels


- -- 
 Signed on Fri Mar 30 23:55:12 2007 with key 0x93B84C15.
 View my photo gallery: http://bloodgate.com/phot
os
 PGP key on http://bloodgate.com/te
ls.asc or per email.

 "Elliot, Sie Schwachkopf!"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg2lXXcLPEOTuEwVAQKsEQf/REU2lTQdaOjP7MBeC+Uw6zdQaSB2
6FgY
cZn9ob0M6Jz2l2+2hukhZQpFbff09QxzVPIPmL3RtUx3SIEdF/3WjFQ7CvLx
fQR8
S0KG3zkhMclrdEAspOlUrW2g+PlC9PuWGSPhUGg+LSvGVkNmQtor7dMoEVQ0
BD1b
4kVRU4s7Jb4A7kyoFYksBumofNg/Qw1Y2Jr2ccn9WU3G6EHNOM4dYWDieq+B
W1Ci
YcGAx+gSS523OvBh73VxYsCDz3RgY1aRWqULmvCCp38F6fluDcDAc14PQnoD
z8j0
PAgkS4wiChq/uSY28wp9IZuoYU8k8+gB3eJtraRGTem+DiW7vgT/yA==
=3Z3P
-----END PGP SIGNATURE-----

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 17:25:14
On Fri, Mar 30, 2007 at 11:28:44PM +0200, Juerd Waalboer
<juerdconvolution.nl> wrote:
> Information overload is probably the single most
problematic thing in
> Perl's unicode documentation. Constantly people are
told all those
> internal implementation details that they don't have to
know.

Exactly. If they wouldn't have to care for those internals
it would be much
simpler, abstracted away. But thats not reality.

> wonder that they start assuming that they actually need
this
> information, and use manual setting of UTF8 flags as
their first resort
> in case of trouble.

If you send a compressed string over the network using JSON
and decompress
it, you need to know that. Evem if you do pure perl only.
And, as I do a
lot of network protocols, this never hurts me as I know how
and when perl
upgrades/downgrades and whats broken or not.

It does hurt other people constantly though, and I do not
understand why
it has to be that way if the fix were conceptually simple
and aligned with
existing usage.

In my talk for example I only hinted at the implementation
details and
told people to ignore it, but when they get weird bugs, they
might look
into that.

My problem is not that there are bugs. My problem is that
those bugs are
not beign fixed because of truely hilarious reasons such as
that obscure
rfereence in the camelbook, so all have to suffer, while
other, similar
bugs, have official bug status and get fixed.

I am really frustrated at that. It makes perl as a whole
rather questionable
for unicode use, as you constantly have to think about the
internals.

And yes, that simply shouldn't be the case.

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcggoof.com
      --==---/ / _ / // / / /      http://schmorp.de/
      -=====/_/_//_/_,_/ /_/_      XX11-RIPE

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )