List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-31 13:38:25
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Saturday 31 March 2007 16:09:18 Juerd Waalboer wrote:
> Tels skribis 2007-03-31 12:23 (+0000):
> > 	#!/usr/bin/perl -w
> > 	use Encode qw/decode/;
> > 	my $random = "xc3xc3";        # some
random bytes
> > 	my $ascii = "a";		# some 7bit data
> >
> > 	# Somebody "helpfull" decodes the ascii
string:
> > 	# The encoding doesn't actually matter, since it
is 7bit anyway.
> > 	# This step happens out of my control (e.g. in
third party code)
> > 	$string = decode('ISO-8859-1', $ascii);
>
> $string is a text string, now. Remember, decoding is
going from byte
> string to text string.

Yes, but my point was that I:

* might not be the one who "decoded" $string or
produced it even.
* do not know if I am passed a "text" string as
there is only the 
flag-you-should-not-know-about to distinguish these two.

> Using unpack "C" on a text string makes no
sense if you consider that
> this "C" doesn't stand for
"character" in the sense that the
> documentation for chr, ord, length, split, etcetera
use. It stands for
> "char", which is a C datatype that contains
one byte.
>
> As such, unpack "C" is a byte operation and
makes sense on byte strings
> only. $string is a text string, and you can tell by
looking at the
> decode() step.
>
> > 	# now take our random binary data and a 7bit
ascii string and do:
> > 	print join (" ",
unpack("CCC", "$random$string")),
"n";
>
> Dangerous, and that's why I suggested adding a
"wide character in..."
> warning earlier in this thread.
>
> > Now explain to me why this prints different things
even tho $random is
> > the same string in both cases, and $string and
$ascii should be the
> > same, too.  Bonus
points if you manage to not mention the uhh -- ut -
> > utf -- uhm -- er The Flag[tm].
>
> I get the bonus points! Hurrah! 

Not really, as you didn't explain the difference, you merely
told me "there 
is a difference" (where me personally don't expect to
be a difference)

> The only explanation that I used is the separation
between text strings
> and binary strings. It's also the only thing you need
to know. You'll  
> benefit from knowing more, certainly, but I see red
flags in your code.

Ok, and how am I supposed know that in:

	sub dosomething {
		my $a = shift;
	}

$a is a text string or a binary string? 


> > So far, I can see the ways to handle this are:
> > (..)
> > * never mix fire and water er dogs and cats er I
mean text and bytes,
> > and pray that every piece of code out there to
adheres to this, too.
>
> Exactly.

This is not a working strategy.

> > I think the Pray and Hope[tm] strategy doesn't
really work, tho.
>
> It doesn't always work, because people can't be trusted
to do the right
> thing, but it can always be fixed.

Only if you consider your own code. But data is sometimes
processed by other 
code (Perl itself, some module etc.). 

All the best,

Tels

- -- 
 Signed on Sat Mar 31 18:33:51 2007 with key 0x93B84C15.
 Get one of my photo posters: http://bloodgate.com/pos
ters
 PGP key on http://bloodgate.com/te
ls.asc or per email.

 "We're looking at a future where only the very largest
companies will be
 able to implement software, and it will technically be
illegal for other
 people to do so."

  -- Bruce Perens, 2004-01-23
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg6qqXcLPEOTuEwVAQINCAf/QWq653liE6ZUnR5sUrO8YFVXU0Gi
5s/m
wm4teby4dypHRuyjKov7a2XeheRCZU+iYXnlNFk8Tioqd3ZOwlZC5uGbufX1
QnpO
H9lYRtDTG14BHH2D+QsMgSrPcAXwsnvSdlePAmy4m9TJ3xQTtzcPLTWt2p8t
giul
URl0lgMHv7I9ASJusYwPa00YRFDexpdVuYpclTtnzzVPoGkuMxAKIDhhAuKp
9uSl
gWJXGiha9hvGEZOh2k6mGZ/bkstEMhp3vrqU1ccp11jfahsaAwvU9EVS7254
t22R
KqXh3Ca4/lMxs+2+1xW0j518Asq0sB/L6gkyGr0tHdFgQwX7S71yoA==
=K82l
-----END PGP SIGNATURE-----

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-31 12:33:55
Tels skribis 2007-03-31 18:38 (+0000):
> * might not be the one who "decoded" $string
or produced it even.
> * do not know if I am passed a "text" string
as there is only the 
> flag-you-should-not-know-about to distinguish these
two.
> (...)
> Ok, and how am I supposed know that in:
> 	sub dosomething {my $a = shift; }
> $a is a text string or a binary string? 

No, not even the flag-you-should-not-know-about doesn't
distinguish
between the two.

When you're writing a library function to handle arbitrary
data, you'll
have to pick sides, either text or binary. Fortunately, the
choice is
often very simple.

When you can't choose between these two, you could write two
functions:
one for text data, one for binary data. Often you can write
the text
function simply by using the binary thing underneath, with a
specified
UTF encoding.

If you're just serializing data, you could opt for storing
the literal
internal buffer along with the state of the UTF8 flag, or
(exactly like
the previous paragraph) pick any specific encoding and stick
to that.

If you happen to have a function in a current API (i.e. not
a contrived
one) for which you find it hard to decide, please let me
know the
details. I'll help you offlist.

> Only if you consider your own code. But data is
sometimes processed by other 
> code (Perl itself, some module etc.). 

Yes, indeed. This can be troublesome. Especially many, many
modules
still don't correctly support Unicode. I'm slowly but surely
compiling a
list at http://juerd.nl/perlun
iadvice. Wanna help?
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )