List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 19:29:42
On Sat, Mar 31, 2007 at 02:16:49AM +0200, Juerd Waalboer
<juerdconvolution.nl> wrote:
> Marc Lehmann skribis 2007-03-31  2:12 (+0200):
> > Yes, and the exact same is true for unicode (both
have a 1-1 mapping
> > between 0..255 and octets), trivially, of course,
as unicode explicitly is
> > a superset of latin1.
> 
> Unicode is a character set, not a character encoding.

As is latin1.

> A unicode string is a sequence of codepoints, not
octets.

Nope. You can encode unicode codepoints into UTF-8 and still
end up with a
unicode string. Encoding doesn't change the fact that it is
unicode that
your are storing.

Since it seems hard to grasp, here is an example:

   my $s = "Hello, World!";
   $s = Encode::encode_utf8 $s;

$s contains the famous greeting before and after the
encoding. It is still
an ASCII string, iso-8859-15 string, and a unicode string,
and a text
string, regardless of wether it is encoded or not, that does
not change
the fact that that string contaisn the message "Hello,
World!".

If you drop ASCII, the same is true for
"Hallöchen!", which looks
differently in UTF-8 then in an unencoded string, but it is
still the same
message. And it is till using unicode to represent the
characters.

The fact that you encode something does not change the
something that you
encode. Making an arbitrary difference only confuses the
issue.

> They don't map 1:1 to octets either. To express a
unicode string
> in octects, you need to encode it. For this, there are
several
> possibilities, including UTF-8, UTF-16, ...

Sure. Octets are just things that store numbers between 0
and 255. The
most compact way to do that in Perl is using a string. Thats
also the most
natural way to represent bytes in Perl, closely followed by
integers for
single bytes.

You do not store octets in latin1, or unicode, or whatever
else in that
string. You are just using the most natural way to represent
octets. And that
just happens to work, because Perl was designed to work that
way.

The mapping between perl bytes and octets is 1:1.. ord and
chr do it for
you, for example, and unpack "n" does it for you
in case you encode/decode
two byte entities. unpack "C", however, does not
map to octets in
perl. Thats the bug.

> Unicode is a superset of the latin1 character set, not
the latin1
> character encoding. We'd need bigger bytes for the
latter 

Right. And Perl has those bigger bytes.

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcggoof.com
      --==---/ / _ / // / / /      http://schmorp.de/
      -=====/_/_//_/_,_/ /_/_      XX11-RIPE

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 20:08:13
Marc Lehmann skribis 2007-03-31  2:29 (+0200):
> > Unicode is a character set, not a character
encoding.
> As is latin1.

For all intents and purposes, latin1 is a character encoding
as well as
a character set. If not officially, then certainly for Perl.
It can be
used with the :encoding layer, with Encode'decode, etcetera.
"Unicode"
cannot.

I don't know where your terminology comes from, but I try to
stick to
whatever is common in Perl land. Sorry if that differs from
other
communities.

> > Unicode is a superset of the latin1 character set,
not the latin1
> > character encoding. We'd need bigger bytes for the
latter 
> Right. And Perl has those bigger bytes.

A byte, in Perl jargon at least, is an octet. An octet can
hold any
single value in the rande 0..255, and is exactly 8 bits in
size. Every
byte is exactly as large as any other byte.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-31 07:26:53
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Saturday 31 March 2007 00:29:42 Marc Lehmann wrote:
> On Sat, Mar 31, 2007 at 02:16:49AM +0200, Juerd
Waalboer 
<juerdconvolution.nl> wrote:
> > Marc Lehmann skribis 2007-03-31  2:12 (+0200):
> > > Yes, and the exact same is true for unicode
(both have a 1-1 mapping
> > > between 0..255 and octets), trivially, of
course, as unicode
> > > explicitly is a superset of latin1.
> >
> > Unicode is a character set, not a character
encoding.
>
> As is latin1.
>
> > A unicode string is a sequence of codepoints, not
octets.
>
> Nope. You can encode unicode codepoints into UTF-8 and
still end up with
> a unicode string. Encoding doesn't change the fact that
it is unicode
> that your are storing.
>
> Since it seems hard to grasp, here is an example:
>
>    my $s = "Hello, World!";
>    $s = Encode::encode_utf8 $s;
>
> $s contains the famous greeting before and after the
encoding. It is
> still an ASCII string, iso-8859-15 string, and a
unicode string, and a
> text string, regardless of wether it is encoded or not,
that does not
> change the fact that that string contaisn the message
"Hello, World!".
>
> If you drop ASCII, the same is true for
"Hallöchen!", which looks
> differently in UTF-8 then in an unencoded string, but
it is still the
> same message. And it is till using unicode to represent
the characters.
>
> The fact that you encode something does not change the
something that you
> encode. Making an arbitrary difference only confuses
the issue.

Especially since Perl itself doesn't have any way to
distinguish "a" 
(UNKNOWN ENCODING) from "a" (ASCII) from
"a" (ISI-8859-1) from "a" 
(UTF-8) - except one bit 

All the best,

Tels

- -- 
 Signed on Sat Mar 31 12:24:31 2007 with key 0x93B84C15.
 Get one of my photo posters: http://bloodgate.com/pos
ters
 PGP key on http://bloodgate.com/te
ls.asc or per email.

 "Most people, I think, don't even know what a rootkit
is, so why should
 they care about it?"

  -- Thomas Hesse, President of Sony BMG's global digital
business
     division, 2005.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEUAwUBRg5TjXcLPEOTuEwVAQIrGAf417/05df4c3hIzTnFoidS3fAKWPHm
9Ots
5BNa8n3PJci4cGQ2Sz7LzRf4BjD6+seW8Zq6fKNMIlCpmwCJYh/M+Ol8BBGe
fjhU
tJxebJs1O2K+ZEd9cJTP/PP2bnqg9Z1CwiBNn8xT/cT8tbF6rR9kujaHooSk
HnPV
snDog7uLrk117tof8ORcybml0bDfhWzh4UfYOyue37RyrqAWnIXNOu24uYUj
MiDT
US3vym0LX+LUO4aBS9Ur/tX6FSBX/5mXDn0fPR016ESbzWA6TMMurSIjWYLF
Tw9R
rRK0KSAb/z93Z6ZhHvyaKOz8Tt9ma44adu6WgTXrK5dcrpih8xbX
=Q94f
-----END PGP SIGNATURE-----

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )