List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 19:12:50
On Sat, Mar 31, 2007 at 12:38:19AM +0200, Juerd Waalboer
<juerdconvolution.nl> wrote:
> > codepoints map to the same byte values. Except it
are different byte 
> > values 
> 
> I said "unicode encoding", but should have
said "unicode codepoints".
> 
> Codepoints 0..256 in latin1 map to byte values 0..256.
That makes it
> special.

Yes, and the exact same is true for unicode (both have a 1-1
mapping
between 0..255 and octets), trivially, of course, as unicode
explicitly is
a superset of latin1.

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcggoof.com
      --==---/ / _ / // / / /      http://schmorp.de/
      -=====/_/_//_/_,_/ /_/_      XX11-RIPE

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 19:16:49
Marc Lehmann skribis 2007-03-31  2:12 (+0200):
> Yes, and the exact same is true for unicode (both have
a 1-1 mapping
> between 0..255 and octets), trivially, of course, as
unicode explicitly is
> a superset of latin1.

Unicode is a character set, not a character encoding.

While for 8 bit character sets, the encoding is the same
thing, once you
get past the 8 bit boundary, the difference begins to
matter.

A unicode string is a sequence of codepoints, not octets.
They don't map
1:1 to octets either. To express a unicode string in
octects, you need
to encode it. For this, there are several possibilities,
including
UTF-8, UTF-16, ...

Unicode is a superset of the latin1 character set, not the
latin1
character encoding. We'd need bigger bytes for the latter

-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

Re: the utf8 flag
user name
2007-03-31 07:00:55
Marc Lehmann schreef:

> unicode explicitly is a superset of latin1.


But there are defined differences:

$ perl -wle '(chr().chr(255)) =~ /^s/ and $n++  and print
for 0..255;
print "[$n]"'
10
12
13
32
[5]

$ perl -wle '(chr().chr(256)) =~ /^s/ and $n++  and print
for 0..255;
print "[$n]"'
10
12
13
32
133
160
[7]


perl -wle '(chr().chr(255)) =~ /^w/ and $n++  and print for
0..255;
print "[$n]"'
...
[63]

perl -wle '(chr().chr(256)) =~ /^w/ and $n++  and print for
0..255;
print "[$n]"'
...
[134]

-- 
Affijn, Ruud

"Gewoon is een tijger."


[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )