List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 18:17:23
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Friday 30 March 2007 21:00:37 Marvin Humphrey wrote:
> On Mar 30, 2007, at 12:53 PM, Juerd Waalboer wrote:
> > Perl does not have strong typing.
>
> If it is so deadly to collide byte-oriented data with
character data,
> it should not be so easy to do so accidentally.

It can happen everytime you concatenate two strings. Maybe
we could add a 
new warning?

	use warnings 'upgrade';

	my $a = 'a';
	$a .= "x100";			# warns

In an application I am currently bringing up to speed in
regard to Unicode I 
opted for a "string" struct, that contains
essentially:

	* the lenght in bytes
	* the lenght in characters (not always set, e.g. can be
unknown)
	* the storage buffer (containing the data, plus some
optional padding)
	* the encoding

Every action between two stings thus becomes very clearly
defined as you can 
compare their encodings before doing anything. (for instance
upgrading one 
or both strings before comparing them etc.) 

In Perl, you have only one bit to tell you the encoding
(utf8), and it seems 
this is not enough as strings without that bit set can be
either ASCII, or 
ISO-8859-1, or the local locale (maybe?), or utf-8 which
hasn't yet tagged 
as UTF-8 etc. In short, it becomes a mess.

All the best,

Tels

- -- 
 Signed on Fri Mar 30 23:11:40 2007 with key 0x93B84C15.
 View my photo gallery: http://bloodgate.com/phot
os
 PGP key on http://bloodgate.com/te
ls.asc or per email.

 "Call me Justin, Justin Case."

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg2ag3cLPEOTuEwVAQKxjwf/Tu2blhDuAawXoTbNOCA9wBnWtvxv
wL05
PoIZOI9vSivXF78ooL8/Hta8pC4o2/TgFdYzORyzNGCGNSdkkj/4vnriZ+f6
7uV2
BQGhzceu7r5U2Byl1xBS/egDB8FOSzB9kX3BcviD+ePjB/gAys0XagCQxfzL
iFEa
mCAp3LVVANmXei0/AgoI/Mj2gO+iz4XX3QvqoL/4tr7Dg734pG/SkYvNE5DL
2sc0
OfTvQPGc8NmLHseEM8Vt0jY/gApHLK0LFn9yh98BbJaGNIaCzNZxtPABGYWj
FoFS
JI1qEVVO4xu0FOJktdEaOSdONTGBincL+4jZ4HbXpi7EMCCZJNLLyw==
=t2+L
-----END PGP SIGNATURE-----

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 16:44:12
Tels skribis 2007-03-30 23:17 (+0000):
> > If it is so deadly to collide byte-oriented data
with character data,
> > it should not be so easy to do so accidentally.
> It can happen everytime you concatenate two strings.
Maybe we could add a 
> new warning?

Eh, no, because Perl does not have any metadata telling you
if this
non-UTF8 string is a latin1 text string, or just a random
byte string.

There is no way to tell Perl how you intended your string to
be used,
and there is no way for Perl to tell you the same thing
about a string
it returned.

> 	use warnings 'upgrade';

This already exists on CPAN, authored by Audrey Tang, as
encoding::warnings:

    use encoding::warnings;

But it will warn when Perl upgrades latin1 to utf-8, without
knowing if
that is a bug or a feature, because it doesn't know if the
"latin1"
string was meant as a text string or a byte string.

It's a useful debugging tool, to find unintended upgrades,
but you
shouldn't try to avoid upgrading altogether. That just
hurts, because
upgrading is part of the way the Perl Unicode model was
intended.

> 	* the lenght in bytes
> 	* the lenght in characters (not always set, e.g. can
be unknown)
> 	* the storage buffer (containing the data, plus some
optional padding)
> 	* the encoding

Hey, cool, Perl has almost the same thing, only it supports
just two
encodings: latin1 and utf8. It uses a single bit to indicate
the
encoding, the UTF8 flag, which can be on or off. When it's
off, the
string is latin1, when it's on, the string is UTF-8.

Maybe you should try Perl; you'll like the way it's built,
because it
very closely matches your own design!

The same type of string can be used for binary data, because
in the
unicode encoding "latin1", all 256 codepoints map
to the same byte
values.

> In short, it becomes a mess.

Yes, with strong typing, especially with string subtypes for
arbitrary
encodings, it would be cleaner. But it would also not look
like Perl 5.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )