List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 16:25:46
Marvin Humphrey skribis 2007-03-30 14:00 (-0700):
> >Perl does not have strong typing.
> If it is so deadly to collide byte-oriented data with
character data,  
> it should not be so easy to do so accidentally.

I agree. But Perl chose to have the same single data type
for all
strings, and to maintain compatibility with older Perls by
assuming that
your byte string is a latin1 string if you start using it as
a text
string. After all, in a strictly 8 bit world, there's no
need for a
distinction, so people were never careful about it.

(Well, there was a need, but ignorance being bliss ignoring
that was
better for anyone's sanity.)

It kind of bothers me that people constantly whine about
this decision
years after it was made. The time to influence the decision
has past. It
just seems so counter-productive to keep bringing it up,
while there are
bugs to be discovered and fixed.

I wasn't active in p5p back then, and if I had been, I would
probably
not have overseen the consequences, just like the porters
then didn't.
But wonderfully, a rather consistent and usable plus useful
model was
invented, with better/easier Unicode/encodings support than
any other
programming language. Of course it's never good enough, but
let's first
focus on finding and fixing bugs.

> That so many users, including those as expert as Marc,
possess a 
> "broken" understanding of Perl's Unicode
model suggests a flawed
> design.

I think the design is solid, but the implementation (see
regex) slightly
broken and documentation wildly misleading.

The documentation thing I'm trying to fix with perlunitut,
perlunifaq,
and a lot of changes to existing documentation, all of which
are now
part of bleadperl and will probably be part of the next Perl
release.

In addition, I'm maintaining a consise list of best
practices at
http://juerd.nl/perluni
advice, and spending tuits on teaching people
(including module maintainers) about the One Way To Do It,
because there
is, in fact, just one way that really works well in this
case. You just
have to find it, and stick to it. TIMTOWTDI doesn't always
apply.

> We have been set up to fail.

Maybe so, but you haven't given up yet, and I hope you
won't. Please
join us in the effort to deal with the problems at hand.
It's a hell of
a lot more productive than praying for the opportunity to
undo recent
years of Perl.

Surely you must know a way in which Perl's unicode support
can be
improved, or accidents avoided, without trying to change all
of Perl,
CPAN, and a gazillion lines of code that we can't even
reach. Let's hear
it! 

Thanks,
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 18:06:47
On Mar 30, 2007, at 2:25 PM, Juerd Waalboer wrote:
>> That so many users, including those as expert as
Marc, possess a
>> "broken" understanding of Perl's Unicode
model suggests a flawed
>> design.
>
> I think the design is solid, but the implementation
(see regex)  
> slightly
> broken and documentation wildly misleading.

I strongly disagree with this assessment.  In particular, I
think  
insisting that the user be responsible for manually
segregating  
character and byte-oriented data without any help from Perl
is  
totally unreasonable.

Look at how easily Marc made the "mistake" of
commingling the two  
types of data.  It's debatable whether the fact that Perl
allowed him  
to do that without complaint is a flaw with the design or
the  
implementation, but it's one or the other and it's serious.

Additionally, as Marc points out, there are lots of broken
XS modules  
out there -- including one of mine. (KinoSearch 0.15 --
Unicode  
support is fixed as of 0.20_01, which breaks backwards  
compatibility.)  Few or none of them would be broken if Perl
made it  
more difficult to move between character data and
byte-oriented data  
-- errors would be flying right and left and the broken
modules would  
get fixed right away.

Of course I understand why that cannot be the case, but it's
 
astonishing to me that you see this as a problem which can
be solved  
via documentation.

I hope that Perl 6 does not opt to replicate Perl 5's
behavior in  
this area (my understanding is that it will not, but I'm not
 
following development closely).  I hope that project is
taking into  
account the lessons we have learned in the wake of very
difficult  
compromises about how to balance the addition of Unicode
with  
preserving backwards compatibility.

> Surely you must know a way in which Perl's unicode
support can be
> improved, or accidents avoided, without trying to
change all of Perl,
> CPAN, and a gazillion lines of code that we can't even
reach. Let's  
> hear
> it! 

How about encouraging the use of encoding::warnings in
perlunitut?

How about adding it to core and having 'use 5.10;' turn it
on?

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/



[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )