List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-31 07:23:43
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Saturday 31 March 2007 00:33:55 Juerd Waalboer wrote:
> Tels skribis 2007-03-31  1:39 (+0000):
> > My question was posed because I wanted to know how
to *keep* a KOI8 (or
> > any other random binary) string in Perl without
converting it to
> > Unicode. It seems to me this is not easily
possible because there are
> > literally dozend places where your KOI8 string
might get suddenly
> > upgraded to UTF-8 (and thus get corrupted because
Perl treats it is
> > ISO-8859-1). Or did I get this wrong?
>
> A koi8r string is a byte string. If you keep it
separated from text
> strings properly, it should not be upgraded and thus
treated as latin1.
> I'm very curious as to "sudden upgrades" that
aren't related to mixing
> with text strings. Should you encounter them, please
let me know.

"Keeping things seperate" is not working in the
Real World[tm]. As far as I 
can see so:

	#!/usr/bin/perl -w
	use Encode qw/decode/;
	my $random = "xc3xc3";        # some random
bytes
	my $ascii = "a";		# some 7bit data

	# Somebody "helpfull" decodes the ascii string:
	# The encoding doesn't actually matter, since it is 7bit
anyway.
	# This step happens out of my control (e.g. in third party
code)
	$string = decode('ISO-8859-1', $ascii);

	# now take our random binary data and a 7bit ascii string
and do:
	print join (" ", unpack("CCC",
"$random$string")), "n";
	print join (" ", unpack("CCC",
"$random$ascii")), "n";

Now explain to me why this prints different things even tho
$random is the 
same string in both cases, and $string and $ascii should be
the same, 
too. 
Bonus points if you manage to not mention the uhh -- ut -
utf -- 
uhm -- er The Flag[tm].

So far, I can see the ways to handle this are:

* replace C with U (lots of code review work, plus it still
means you
  200Mbyte TIFF file might make a trip to UTF-8 land and
back)
* always forcefully downgrade stuff in 7bit ASCII
(wastefull) and just hope
  your 8bit data never get's in contact with anything with
The Flag[tm]
* never mix fire and water er dogs and cats er I mean text
and bytes, and
  pray that every piece of code out there to adheres to
this, too.

I think the Pray and Hope[tm] strategy doesn't really work,
tho.

All the best,

Tels

- -- 
 Signed on Sat Mar 31 12:09:53 2007 with key 0x93B84C15.
 Get one of my photo posters: http://bloodgate.com/pos
ters
 PGP key on http://bloodgate.com/te
ls.asc or per email.

 "Sundials don't work, the one I've had in my basement
hasn't changed
 time since I installed it." grub (11606) on 2004-12-03
on /.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg5Sz3cLPEOTuEwVAQJvegf+OVl0Ha2tJ3QIXmkUs+XHXWdYIqtu
9xJe
VeBwrelub65lfgIfD8FnNmft+KgZDE8S8QU3sjFo5NArtVT56tFsAeIwtdtC
23au
BcobxZxkI9iHWJtkJYlxKHEdSPbWSgJiWfJ7J3fc4zprme3/Zlxgpcd3pyiR
ee0m
AhpnZ6dui033dNakhZCHu1L/YeUyP72OmGmtWOAJLHGIQ/w0nUrUJrx5kg3W
uV88
ATfl7EFVZOxqavSSWJCgBHXvU8iRUg4mmqpoVPY4S9uqMi9IYCZBPZNAc++M
Sjbn
b0e8+qPTB43zah6EfNSc5Xq22EDEjx7mu0n62FQhajV1lOIoc0kV7g==
=CfKu
-----END PGP SIGNATURE-----

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-31 11:09:18
Tels skribis 2007-03-31 12:23 (+0000):
> 	#!/usr/bin/perl -w
> 	use Encode qw/decode/;
> 	my $random = "xc3xc3";        # some
random bytes
> 	my $ascii = "a";		# some 7bit data
> 
> 	# Somebody "helpfull" decodes the ascii
string:
> 	# The encoding doesn't actually matter, since it is
7bit anyway.
> 	# This step happens out of my control (e.g. in third
party code)
> 	$string = decode('ISO-8859-1', $ascii);

$string is a text string, now. Remember, decoding is going
from byte
string to text string.

Using unpack "C" on a text string makes no sense
if you consider that
this "C" doesn't stand for "character"
in the sense that the
documentation for chr, ord, length, split, etcetera use. It
stands for
"char", which is a C datatype that contains one
byte.

As such, unpack "C" is a byte operation and makes
sense on byte strings
only. $string is a text string, and you can tell by looking
at the
decode() step. 

> 	# now take our random binary data and a 7bit ascii
string and do:
> 	print join (" ", unpack("CCC",
"$random$string")), "n";

Dangerous, and that's why I suggested adding a "wide
character in..."
warning earlier in this thread.

> Now explain to me why this prints different things even
tho $random is the 
> same string in both cases, and $string and $ascii
should be the same, 
> too.  Bonus
points if you manage to not mention the uhh -- ut - utf -- 
> uhm -- er The Flag[tm].

I get the bonus points! Hurrah! 

The only explanation that I used is the separation between
text strings
and binary strings. It's also the only thing you need to
know. You'll
benefit from knowing more, certainly, but I see red flags in
your code.

> So far, I can see the ways to handle this are:
> (..)
> * never mix fire and water er dogs and cats er I mean
text and bytes, and
>   pray that every piece of code out there to adheres to
this, too.

Exactly.

> I think the Pray and Hope[tm] strategy doesn't really
work, tho.

It doesn't always work, because people can't be trusted to
do the right
thing, but it can always be fixed.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )