List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 13:46:41
On Fri, Mar 30, 2007 at 08:00:36PM +0200, Marc Lehmann
wrote:
> On Fri, Mar 30, 2007 at 01:31:22PM +0100, Nicholas
Clark <nickccl4.org> wrote:

> However, some of the obvious fixes would be to change
ExtUtils/typemap so
> that stuff such as "const char *" does no
longer boil down to random bytes.
> Example:
> 
> SV *compress (const char *data);
> 
> the right thing here is to use SvPVbyte, at leats in
the majority of
> cases.  The reason is that existing users either have
to clal downgrade
> explicitly themselves or suffer from random problems.

This seems a sane idea. However, I'm not going to change it
for 5.8.9

5.10 is a different matter, but also not my call.

> Could you tell me why almost every other 5.6 bug was
fixed in 5.8, but
> gratitious breakage of large parts of CPAN are accepted
with this change?
> Whats the rationale behind keeping this 5.6 bug, while
fixing the rest?

No, I can't.
5.8.0 and 5.8.1 were not my releases, *and* I wasn't aware
that 'C' was a
problem at that time.

I *think* that the reason may have been because "it is
documented in
Programming Perl" that it behaves the 5.6.0 way.

*but*

I went looking, and the closest I can find to an assertion
about how it works
is:

* the pack/unpack letters "c" and "C" do
/not/ change, since they're often
  used for byte-orientated formats. (Again, think
"char" in the C language.)
  However, there is a new "U" specifier that will
convert between UTF-8
  characters an integers:

    pack("U*", 1, 20 ,300, 4000) eq
v1.20.300.4000

* The chr and ord functions work on characters

    chr(1).chr(20).chr(300).chr(4000) eq v1.20.3000.4000

  In other words, chr and ord are like pack("U")
and unpack("U"), not like
  pack("C") and unpack("C"). In fact,
the latter two are how you now emulate
  byte-orientated chr and ord if you're too lazy to use
bytes.

[3rd edition, page 408]

> > I don't like anything Perl space that lets the
abstraction leak, and "C" is
> > one of them.
> 
> So why not fix it? Nobody made such a fuss when they
fixed the remaining bugs
> from 5.6. For example, PApp, one of my older modules
using unicode, is full

I'm not going to change anything this late in 5.8.x.
Whether 5.10 changes is not something I have the final say
on.


> And as I said, there is no pack-type that gives me the
old meaning of
> "C" that every structure-decoding program
relies on. Thats gratitious
> undocumented breakage. (It really is undocumented
because all of the perl
> documentation tells me that the internal encoding
doesn't surface, and the
> small hint in the pack description for "C"
seems to reinforce this as it
> tells me it works "even in the presence of
Unicode"!).
> 
> In any case, please could you answer to me why you
accept obvious breakage
> of old code in this case? I really wanna know.

> 
> The only argument in favour I have heard os far is that
the camelbook
> documents it in some obscure way. But that cannot be a
reason to keep a
> bug.  If the camelbook describes buggy behaviour, it
needs a fix. It is
> insane to force every existing perl program that uses
that feature to
> be changed in a way that contradicts the rest of the
documentation, is
> unintuitive and generaly useless (again, show me a
useful application for
> unpack "C" with 5.8 semantics).

I agree with the obscure now.

Reading the wording of the Camel book carefully, this
behaviour

$ perl5.00503 -le 'print unpack "c", chr (256+78)'

78
$ perl5.00503 -le 'print unpack "C", chr
(256+78)'
78


"unchanged" actually means to me that it would
produce the same output.

The only thing that seems to define the current 5.6
behaviour is the
comparison of unpack("C") with ord under use bytes
in the paragraph on chr
and ord.

Nicholas Clark

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 17:41:18
On Fri, Mar 30, 2007 at 07:46:41PM +0100, Nicholas Clark
<nickccl4.org> wrote:
> This seems a sane idea. However, I'm not going to
change it for 5.8.9

Sure.

> 5.10 is a different matter, but also not my call.

Sure.

I know all that...

> > Could you tell me why almost every other 5.6 bug
was fixed in 5.8, but
> > gratitious breakage of large parts of CPAN are
accepted with this change?
> > Whats the rationale behind keeping this 5.6 bug,
while fixing the rest?
> 
> No, I can't.
> 5.8.0 and 5.8.1 were not my releases, *and* I wasn't
aware that 'C' was a
> problem at that time.

Yes, you can. You control 5.8, and you said it won't gonna
happen. So either
you have a reason and can tell me of it, or not.

The reason I wanna know is because I want to know what to
tell
people. Either it is "your code is broken, unpack
"C" without downgrade
is a bug in your code" or "it is a bug in perl,
you can work around by
enabling ->shrink for the time being".

> I *think* that the reason may have been because
"it is documented in
> Programming Perl" that it behaves the 5.6.0 way.

I would argue it doesn't behave the 5.6 way, though: 5.6 had
a completely
broken unicode implementation, and lots of bugs. In 5.6 it
would give me one
"character", because 5.6 often exposed the utf-8
encoding explicitly, so one
character in the 5.6 model often was a single internal
byte.

Also, I still think it is a mistake to break working code
without giving
an alternative(!) for unpack that isn't "you have to
downgrade and keep
your fingers crossed".

> I went looking, and the closest I can find to an
assertion about how it works
> is:
> 
> * the pack/unpack letters "c" and
"C" do /not/ change, since they're often
>   used for byte-orientated formats. (Again, think
"char" in the C language.)
>   However, there is a new "U" specifier that
will convert between UTF-8
>   characters an integers:
> 
>     pack("U*", 1, 20 ,300, 4000) eq
v1.20.300.4000

Exactly. But "C" somehow works on UTF-8, while it
shouldn't. It should
work on characters, as documented (just like in C, char
array[]; array[i]
is one character, regardless of how many bits a character in
C has, or how
it is encoded).

> * The chr and ord functions work on characters
> 
>     chr(1).chr(20).chr(300).chr(4000) eq
v1.20.3000.4000
> 
>   In other words, chr and ord are like
pack("U") and unpack("U"), not like
>   pack("C") and unpack("C"). In
fact, the latter two are how you now emulate
>   byte-orientated chr and ord if you're too lazy to use
bytes.

So due to that documentation insanity it is now suggested
that all code that
used "C" beforee muts use "U" now to get
the same effect as in earlier perl
versions?

Then why was "use feature" introduced in the first
place? Just document
existing programs to be broken. I am quite convinced
(whatever that means
to you  that that
would result in less and less silent breakage then
renimong "C" to "U".

Besides, perl 5.8 does not follow that description:

   perl -e '$x = "xc3xbc"; die unpack
"U*", $x'

This gives me 195188, two characters, although it is a
single UTF-8
character, so why does it wrongly give me two? $x certainly
is utf-8-encoded
(try Encode::encode_utf8 chr 252, it results in the above
string).

Whoever wrote that part, simply said, was completely
confused about unicode.
Thats fine, Sarathy had to hammer it into me too, and then
made a mistake
himself after he did so. And it took me years to understand
how it should be.
It is hard to do from an implementors standpoint because you
are so near the
actual code.

But that doesn't mean it is right. Fact is, the above
documentation is
simply wrong, either with regards to how it should be, and
in regards to how
it is implemented.

> [3rd edition, page 408]

(Thanks for digging it out, btw, I haven't seen that yet).

> > So why not fix it? Nobody made such a fuss when
they fixed the remaining bugs
> > from 5.6. For example, PApp, one of my older
modules using unicode, is full
> 
> I'm not going to change anything this late in 5.8.x.
> Whether 5.10 changes is not something I have the final
say on.

Ok, so I will tell people to replace "C" by
"U" in theor code then.

Thanks! (And go on with your good work, btw., it seems that
wasn't quite
clear to some people, so again: you are doing tremendously
good work! .

> "unchanged" actually means to me that it
would produce the same output.
> 
> The only thing that seems to define the current 5.6
behaviour is the
> comparison of unpack("C") with ord under use
bytes in the paragraph on chr
> and ord.

Right, while the documentation on unpack "U"
disagrees with it, as it talks
about UTF-8. The documentation clearly does not apply to
current perls, it
clearly applies to the 5.005_5x model where perl ahd no
UTF-X flag.

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcggoof.com
      --==---/ / _ / // / / /      http://schmorp.de/
      -=====/_/_//_/_,_/ /_/_      XX11-RIPE

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )