List Info

Thread: perl, the data, and the tf8 flag




perl, the data, and the tf8 flag
user name
2007-03-31 06:45:20
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Saturday 31 March 2007 00:20:52 Marc Lehmann wrote:
> On Sat, Mar 31, 2007 at 01:39:06AM +0000, Tels 
<nospam-abusebloodgate.com> wrote:
> > My question was posed because I wanted to know how
to *keep* a KOI8 (or
> > any other random binary) string in Perl without
converting it to
> > Unicode. It seems to me this is not easily
possible because there are
> > literally dozend places where your KOI8 string
might get suddenly
> > upgraded to UTF-8 (and thus get corrupted because
Perl treats it is
> > ISO-8859-1). Or did I get this wrong?
>
> Yes, you did get that wrong, liekly because Juerd wants
users to care
> about that. But in fact, if you try it, nothing will
get corrupted unless
> you use unpack "C" to get the first byte of
your KOI8-string. Then you
> might get surprised (current perl) or an exception
(Juerd's idea).

I should have said "random binary data" not
"KOI8". "KOI8" implies the data 
is some sort of text that can be "upgraded" to
utf-8.

Now, you can *always* treat random binary datas f.i.
ISO-8859-1, upgrade it 
to UTF-8 and then downgrade it again, since this is a
lossless 
transformation. But that doesn't mean it is a good idea
because:

* speed - useless transcodings
* memory (utf-8 needs more memory, and the transcoding,
too)
* pack/unpack or any other "peeking" at the data
might leak the fact that 
Perl suddenly converted "xfc" to
"xc3xbc" underneath (as Marcs bugreport 
showed).

So, yes, if Perl works perfectly in every place, converting
you data always 
on the fly whenever you look at it, you could stuff
"KOI8" or any other 
random binary data in, have it (maybe) converted to utf-8,
and on 
output/looking at converted back to the exact bytes you
stuffed in.

However, as you demonstrated yourself, Perl doesn't work
perfectly 

What I was trying to get at is there are different types of
data. Before any 
encoding or data examination goes on you have:

** random binary data (see notes above why you do not want
this treated as 
ISO-8859-1 and "text"). Basically, you never want
Perl to encode/decode it, 
and any attempt in doing so should result in an
warning/exception. (utf-8 
flag off)

** ascii 7 bit data (utf-8 flag off)

** 8bit data with an encoding (assumed is ISO-8859-1, but
user can specify 
other types of encoding during a call to "decode")
(utf-8 flag off)

** utf-8 data (utf-8 flag on)

As you can see, there are four different types of data, but
Perl has only 
one bit flag to distiguish them. 

So whenever you have data without the utf-8 flag, Perl needs
to decide 
between the three cases mentioned above. And since it cannot
store the 
decision of "already seen 7bit ASCII", it needs to
do this again sometime 
later.

This is costly (scanning for hight bit characters to
distiguish between 7bit 
ascii and 8bit "something else"), and it overly
simple, because Perl cannot 
distiguish between "text data in ISO-8859-1 or whatever
encoding is in 
effect" and "binary data which shouldn't be
treated as text".

As an author who inherited software that deals with random
binary data (e.g. 
JPEGs), this deficency concerns me.

Unfortunately, I am in no position to do anything about it
except bitch on 
some random mailing list :( Wheere is a time-machine
whenever you need one?

[snip]

> > As you said, the current warnings::encode can't
decide between the case
> > of "BINARY + UTF_8" and "ISO-8859-1
+ UTF_8" as Perl makes no
> > distinction between binary data and ISO-8859-1.
And this missing
> > distinction is certainly a bother 
>
> Only when you hit bugs, or unpack.

<sarcasm> and you never hit bugs, or use unpack
</sarcasm> 

All the best,

Tels

- -- 
 Signed on Sat Mar 31 11:28:57 2007 with key 0x93B84C15.
 View my photo gallery: http://bloodgate.com/phot
os
 PGP key on http://bloodgate.com/te
ls.asc or per email.

 "Duke Nukem Forever will come out before Unreal
2."

  -- George Broussard, 2001 (http://tinyurl.com/6m8nh
)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg5J2ncLPEOTuEwVAQJTzwf/TH9JUUnoTOq8+sRpROPhb17oWRjL
mNs4
+S+vuSldaCk0qxG6LB8NvoJW8BEX7ldz+4zTaEn0/WKi3e+v9YmWFMqblqnR
Lm5H
lEH7FbVCY+TAINJfVj24JJNaBtZc6ptqqYNzStuVD0T2aNutv5vIVgTdKtkg
dYHM
gLuG53iqN70zqwOSnn/Acq91zC56/LvEkGRZzdBwwj+qWbC7UXLJhRtc3ZuC
CI9m
DblbMiKoGzorDF7dQVeguBnyohvdCEvKqMPOvs6Wp/ZVReN/DDXhlsGh7kJ3
Pjl2
9C9Nmds9KuFkmvsleXZEy5KPmGIKyJVX33llQKPj9woe0g2Iyjeaeg==
=4lLh
-----END PGP SIGNATURE-----

Re: perl, the data, and the tf8 flag
user name
2007-03-31 05:03:12
Tels skribis 2007-03-31 11:45 (+0000):
> I should have said "random binary data" not
"KOI8". "KOI8" implies the data 
> is some sort of text that can be "upgraded"
to utf-8.

Not if "upgrading" refers to the process that Perl
has when it goes from
latin1 to utf-8, because this doesn't handle arbitrary
encodings like
koi8r.

Your best bet is to treat koi8r encoded data as binary data.
(And as
such, not mix it with text data.) If this is confusing, you
may want to
gzip it first, and ungzip it afterwards ;)

> Now, you can *always* treat random binary datas f.i.
ISO-8859-1,
> upgrade it to UTF-8 and then downgrade it again, since
this is a
> lossless transformation. But that doesn't mean it is a
good idea

Exactly.

> * speed - useless transcodings
> * memory (utf-8 needs more memory, and the transcoding,
too)
> * pack/unpack or any other "peeking" at the
data might leak the fact that 
> Perl suddenly converted "xfc" to
"xc3xbc" underneath 

Good summary. Also, if you output it to an encodingless
filehandle
before downgrading it again, the value may contain
characters greater
than 127, and you'll get output that you probably did not
intend.

> ** random binary data (see notes above why you do not
want this treated as 
> ISO-8859-1 and "text"). Basically, you never
want Perl to encode/decode it, 
> and any attempt in doing so should result in an
warning/exception. (utf-8 
> flag off)

Yep.

> ** ascii 7 bit data (utf-8 flag off)

The UTF8 flag can also off for 8 bit data. For ASCII data it
will
typically be off, but it wouldn't matter if it were on.
(That is, if you
treat ASCII data like text. You don't want to treat UTF8
carrying data
as binary, though, because you will want to mix binary data
with other
binary data, without having it upgraded.)

> As you can see, there are four different types of data,
but Perl has only 
> one bit flag to distiguish them. 

I'd say it has two types of data, and indeed that one bit.

With the bit on, it's unicode data that internally is
encoded as UTF-8.
You're not supposed to access the UTF-8 encoded octet
buffer. This
string should never be used with octet operations like vec
or unpack "C"
or "n".

With the bit off, it's either unicode data that internally
is encoded as
ISO-8859-1, or it is binary data. This string can safely be
used for
octet operations (but of course, that doesn't make sense if
the sting
was intended as text, with the exception of some ancient
8bit things
crypt()).

> So whenever you have data without the utf-8 flag, Perl
needs to decide 
> between the three cases mentioned above. 

It doesn't do that. Every UTF8less string is treated the
same.

> This is costly (scanning for hight bit characters to
distiguish between 7bit 
> ascii and 8bit "something else")

I'm not aware of Perl scanning for high bit characters in
UTF8less
strings, or any performance loss caused by that.

> As an author who inherited software that deals with
random binary data (e.g. 
> JPEGs), this deficency concerns me.

I'm not aware of such a deficiency, and my Perl handles JPEG
data just
fine as long as I don't let it touch unicode text data.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )