List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-31 11:27:11
Ben Carter skribis 2007-03-31  4:08 (-0600):
> Unicode does not even HAVE characters, it has
codepoints.  

Very good point, but Perl's documentation refers to
codepoints as
"characters", and does that rather consistently.

I'm considering sweeping through the docs and changing it
all, but it
would be a lot of work and a huge patch. I wonder if it's
worth that.

> Now consider the case of
>   $y = chr(1000);
> Clearly whatever is in $y cannot be a single octet. 
The way Perl
> currently works is that now $y is considered to be a
string of Unicode
> codepoints. 

Yes.

But to go into a bit more detail for the more interesting
case of
chr(233): this is either a byte string with only one byte,
or a text
string with only one cha^Wcodepoint. Perl doesn't know, or
care, so the
programmer has to.

> So $y contains a single codepoint, U+03E8.  The
internal flag is used
> to indicate that the internal data pointer points to
something that is
> a "Unicode codepoint string".

No, see Abigail's response for clarification.

>   print unpack("H*", pack("C",
1000));

Feeding 1000 to C has undefined behaviour: the C type can
only handle
values 0..255, and there's no documentation defining what
happens if you
feed it something <0 or >255. A similar thing occurs
with floating point
numbers, like 64.5. The current implementation truncates
that to 64,
without warning.

> If you expect values over 255, then you should not use
"C".

Indeed!

> Of course if you have values over 255 you have to use
"U" in unpack,
> that only makes sense!  

If these values are codepoints, yes. But if they're just
numbers, other
unpack templates, like perhaps N or V are better.

> [1] I am deliberately ignoring the box in the corner
labeled "EBCDIC".

Oh, so am I. In fact, I've probably never even seen such a
box in my
short life so far.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )