|
List Info
Thread: Re: perl, the data, and the tf8 flag
|
|
| Re: perl, the data, and the tf8 flag |

|
2007-03-31 07:40:31 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Moin,
On Saturday 31 March 2007 10:03:12 Juerd Waalboer wrote:
> Tels skribis 2007-03-31 11:45 (+0000):
> > I should have said "random binary data"
not "KOI8". "KOI8" implies the
> > data is some sort of text that can be
"upgraded" to utf-8.
>
> Not if "upgrading" refers to the process that
Perl has when it goes from
> latin1 to utf-8, because this doesn't handle arbitrary
encodings like
> koi8r.
>
> Your best bet is to treat koi8r encoded data as binary
data. (And as
> such, not mix it with text data.)
The "do not mix it" is the part where I am
currently having problems with.
As far as I can see, there is nothing in Perl that prevents
this from
happening, nor can I enable a warning when it happens. All
you get is at
some point corrupted data, or very inefficient code (since
Perl internally
uses UTF-8 while it could use just the raw bytes).
> If this is confusing, you may want to
> gzip it first, and ungzip it afterwards ;)
It is not confusing to me, but gzip wouldn't actually help
when Perl
helpfully upgrades the gzippd data to utf-8
> > Now, you can *always* treat random binary datas
f.i. ISO-8859-1,
> > upgrade it to UTF-8 and then downgrade it again,
since this is a
> > lossless transformation. But that doesn't mean it
is a good idea
>
> Exactly.
>
> > * speed - useless transcodings
> > * memory (utf-8 needs more memory, and the
transcoding, too)
> > * pack/unpack or any other "peeking" at
the data might leak the fact
> > that Perl suddenly converted "xfc" to
"xc3xbc" underneath
>
> Good summary. Also, if you output it to an encodingless
filehandle
> before downgrading it again, the value may contain
characters greater
> than 127, and you'll get output that you probably did
not intend.
>
> > ** random binary data (see notes above why you do
not want this treated
> > as ISO-8859-1 and "text"). Basically,
you never want Perl to
> > encode/decode it, and any attempt in doing so
should result in an
> > warning/exception. (utf-8 flag off)
>
> Yep.
>
> > ** ascii 7 bit data (utf-8 flag off)
>
> The UTF8 flag can also off for 8 bit data. For ASCII
data it will
> typically be off, but it wouldn't matter if it were on.
(That is, if you
> treat ASCII data like text. You don't want to treat
UTF8 carrying data
> as binary, though, because you will want to mix binary
data with other
> binary data, without having it upgraded.)
>
> > As you can see, there are four different types of
data, but Perl has
> > only one bit flag to distiguish them.
>
> I'd say it has two types of data, and indeed that one
bit.
>
> With the bit on, it's unicode data that internally is
encoded as UTF-8.
> You're not supposed to access the UTF-8 encoded octet
buffer. This
> string should never be used with octet operations like
vec or unpack "C"
> or "n".
I know what you mean, but the problem is that you are also
proposing that
the UTF-8 flag should be hidden from the user. So, how can I
"not access
the UTF-8 encoded" buffer when I don't know if the
buffer I access is UTF-8
or not?
I think this is also the problem Marc is having with your
POV. You can't
hide the internal encoding from the user, then telling him
"do not mix
these two different things even tho you do not know which
one is which".
That's a bit, er, unrealistic.
> With the bit off, it's either unicode data that
internally is encoded as
> ISO-8859-1, or it is binary data. This string can
safely be used for
> octet operations (but of course, that doesn't make
sense if the sting
> was intended as text, with the exception of some
ancient 8bit things
> crypt()).
>
> > So whenever you have data without the utf-8 flag,
Perl needs to decide
> > between the three cases mentioned above.
>
> It doesn't do that. Every UTF8less string is treated
the same.
And that is in efficient
> > This is costly (scanning for hight bit characters
to distiguish between
> > 7bit ascii and 8bit "something else")
>
> I'm not aware of Perl scanning for high bit characters
in UTF8less
> strings, or any performance loss caused by that.
use Benchmark;
use Encode qw/decode/;
my $a = 'a' x 100_000_000; # 7bit utf-8 off
my $b = 'b' x 100_000_000; # 7bit utf-8 off
my $c = 'c' x 100_000_000; # 7bit utf-8 flag on
$c = decode('ISO-8859-1', $c);
timethese (-3, {
'a eq b' => sub { $a eq $b; },
'a eq c' => sub { $a eq $c; },
} );
Benchmark: running a eq b, a eq c for at least 3 CPU
seconds...
a eq b: 4s (4.72 usr + -0.02 sys = 4.70 CPU) 7218655.96/s (n=33927683)
a eq c: 3s (2.80 usr + 0.46 sys = 3.26 CPU) 2.76/s
(n=9)
I rest my case.
All the best,
Tels
- --
Signed on Sat Mar 31 12:28:15 2007 with key 0x93B84C15.
View my photo gallery: http://bloodgate.com/phot
os
PGP key on http://bloodgate.com/te
ls.asc or per email.
"Blogebrity: Wow, guess what this one stands for? Too
easy. Hey, anyone
can do it: take a blogger who's a chef, and you get: BLEF.
A blogger
who's a dentist? BENTIST. A female blogger with an itch?
You guessed it:
a BITCH."
-- maddox from xmission
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iQEVAwUBRg5Wv3cLPEOTuEwVAQKVMQf9G1RLUfo+fY+H8dn4Qa+ggbL/IRnO
z3wi
sR4KAw32xrCvHPZYkQRPm1xVJiDwpMDgEgdVSEo6Ot9qA3TLXGadF4F9PMzP
QRWM
4509df7yoEulvKsKNiqHFJSbxO8KlVaX4CO8Zr/8aCnM4IIajBuISRQUtLAR
Rl/d
VQacgTOJwHCkaRqB8T+9kdP3U9OV72xXoYDHRXRbJOiav7QVGmmVib5M2ZQW
j5zv
H8r1daSG7mFg3qCUE/KKYLAC2hmMMvC31zhMzWveAxlFE5hWg+EyYFzxbPk9
sisT
69seb4XaXXrpM/jn7C3Gq2GKeEggeRDrAhw3DvlPrO0r1VZYvmFDwQ==
=YMp0
-----END PGP SIGNATURE-----
|
|
| Re: perl, the data, and the tf8 flag |

|
2007-03-31 11:04:13 |
Tels skribis 2007-03-31 12:40 (+0000):
> The "do not mix it" is the part where I am
currently having problems with.
> As far as I can see, there is nothing in Perl that
prevents this from
> happening, nor can I enable a warning when it happens.
This is true, but no different from other things that you
should keep
track of yourself. Some operations can change the type of a
variable,
not just inside, but also conceptually.
* references
$ref++, and it's no longer a ref.
* strings
$string++, and it's no longer a string.
* numbers
"x" on a number very rarely makes sense.
Though this is all visible in your code, because there are
different
operators, and they are known to force their type upon the
values
(simplified explanation).
Text strings and byte strings share a single type, but also
a single set
of operators. Indeed, that makes it harder to cope with
keeping them
apart.
Some people may like a hungarian notation for it.
> All you get is at some point corrupted data, or very
inefficient code
> (since Perl internally uses UTF-8 while it could use
just the raw
> bytes).
If you accidentally mix them, yes. But if you don't, the
byte string
won't be upgraded to utf8 (when it is, that is probably a
bug that
should be fixed), and your bytestring just lives on exactly
like it
would have in Perl 5.005, or 4, or perhaps 1 even.
> It is not confusing to me, but gzip wouldn't actually
help when Perl
> helpfully upgrades the gzippd data to utf-8
Perl is helpful when it sees you're using the string as a
text string.
It them assumes that it had been latin1 all the time.
It would be useful to have magic on a string that enforced
non-upgrading, but only for strings that you want it on.
This would be the bondage part, for when discipline was
broken.
> I know what you mean, but the problem is that you are
also proposing that
> the UTF-8 flag should be hidden from the user. So, how
can I "not access
> the UTF-8 encoded" buffer when I don't know if the
buffer I access is UTF-8
> or not?
Accessing the buffer directly is something that byte
operators do, e.g.
vec and unpack("C"). If you never mix your byte
strings with text
strings, and use these operators only with byte strings, you
can be sure
that the variables won't be UTF8 internally.
Note that if you refactor this guideline, the
"UTF8" part disappears.
> > > This is costly (scanning for hight bit
characters to distiguish between
> > > 7bit ascii and 8bit "something
else")
> > I'm not aware of Perl scanning for high bit
characters in UTF8less
> > strings, or any performance loss caused by that.
> use Benchmark;
> use Encode qw/decode/;
> my $a = 'a' x 100_000_000; # 7bit utf-8 off
> my $b = 'b' x 100_000_000; # 7bit utf-8 off
> my $c = 'c' x 100_000_000; # 7bit utf-8 flag on
> $c = decode('ISO-8859-1', $c);
> timethese (-3, {
> 'a eq b' => sub { $a eq $b; },
> 'a eq c' => sub { $a eq $c; },
> } );
> Benchmark: running a eq b, a eq c for at least 3 CPU
seconds...
> a eq b: 4s (4.72 usr + -0.02 sys = 4.70 CPU) 7218655.96/s (n=33927683)
> a eq c: 3s (2.80 usr + 0.46 sys = 3.26 CPU) 2.76/s
(n=9)
Ah, good to know there are more people who don't mind using
100 MB
strings.
I thought you meant implicit scanning, i.e. not caused by
manual
decoding, or automatic upgrading.
decode might optimize latin1 or ascii some day. The
documentation
already claims that it does that, but it doesn't.
When optimizing, knowledge of the internals can help a great
deal. I
stress that you don't need this knowledge for a working
program, and
that working with 100 MB strings and then comparing them in
a tight loop
is not common. But anyway, a nice optimization is to do
utf8::downgrade
on a string that you just decoded from latin1. Then you pay
only a
one-time price. Depending on your data, however, a better
optimization
may be to utf8::upgrade the other two.
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
[1-2]
|
|