List Info

Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)




Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 18:08:21
Marc Lehmann skribis 2007-03-31  0:25 (+0200):
> If you send a compressed string over the network using
JSON and decompress
> it, you need to know that. 

Does JSON compress arbitrary data? If so, then the user must
do the
decoding and encoding, because arbitrary data only exists in
byte form.
Once you dictate any specific encoding, it's no longer
arbitrary.

On the other hand, if JSON does text data only, it can just
use any UTF
encoding on both sides, and document it like that.

Unless both sides are exactly the same platform (e.g. both
Perl), you
need to establish a protocol for sending data anyway. And
that protocol
should also describe encoding. If sender and receiver don't
agree, you
have a problem.

> I am really frustrated at that. It makes perl as a
whole rather
> questionable for unicode use, as you constantly have to
think about
> the internals.  And yes, that simply shouldn't be the
case.

I maintain that it isn't the case, for almost any
programming job,
unless you're indeed doing things with internals.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 20:00:24
Ok, last mail, because this is a different topic 

On Sat, Mar 31, 2007 at 01:08:21AM +0200, Juerd Waalboer
<juerdconvolution.nl> wrote:
> Marc Lehmann skribis 2007-03-31  0:25 (+0200):
> > If you send a compressed string over the network
using JSON and decompress
> > it, you need to know that. 
> 
> Does JSON compress arbitrary data?

no.

> If so, then the user must do the decoding and
encoding,
   
No, compression is something completely orthogonal from
encoding. Neither
forces me to do the other.
   
> because arbitrary data only exists in byte form

Thats eems completely wrong to me.

> Once you dictate any specific encoding, it's no longer
arbitrary.

JSON dictates unicode for the JSON text, and strongly hints
at the use of
UTF-8 for interchange purposes.

> On the other hand, if JSON does text data only,
   
No, it does support binary data just as well. It is used a
lot, too.

It works just like perl without the bugs: You have a string
type that can
store bytes. It is up to the user to interpret them as she
wants.

> it can just use any UTF encoding on both sides, and
document it like
> that.

It is a bit complicated, but you can safely assume that 99%
of all JSON
is UTF-8 encoded. In fact, you can recode all JSON documents
into ASCII,
too. JSON::XS offers that, and JSON::XS by default encodes
to/decodes
from UTF-8, but allows the user to decode/encode himself.
JSON text is
composed of unicode characters, and in Perl some JSON
modules store them
as a simple Perl string.

All that is not well-supported by most JSON modules, though,
for example
JSON::XS is the only module for perl that correctly decodes
escaped
surrogate pairs.

> Unless both sides are exactly the same platform (e.g.
both Perl), you
> need to establish a protocol for sending data anyway.
And that protocol
> should also describe encoding. If sender and receiver
don't agree, you
> have a problem.

No, it doesn't have anything to do with the platform. Even
when both sides
use Perl I need to decide on a common encoding. Thats
strictly outside the
JSON definition, though.

> > I am really frustrated at that. It makes perl as a
whole rather
> > questionable for unicode use, as you constantly
have to think about
> > the internals.  And yes, that simply shouldn't be
the case.
> 
> I maintain that it isn't the case, for almost any
programming job,
> unless you're indeed doing things with internals.

Well, the JSON::XS module certainly does things with the
internals, it
has to flag some strings as UTF-X, and in fact flags all
strings that
way unless you enable the shrink option, which is documented
to try to
shrink the memory used in various ways (one way is to try to
downgrade the
scalar).

Certainly, the user who reported the bug also didn't look at
the
internals.  Compress::Zlib called unpack "CCCV" or
somesuch, though, which
unfortunately treats V very different from C, by looking at
the internals
with "C", and not doing that and treating the
string as an octte string
with "V".

The user suggested that JSON::XS corrupts binary data
because it happens to
be returned upgraded unless you set the shrink option.

However, Perl does not expose the internals elsewhere, the
upgraded
version is semantically equivalent to the downgraded one
unless you use
an XS module using SvPV directly or indirectly (considered a
bug in Perl
when I understood nick correctly), or when using unpack
"C", as that has
a different meaning in perl 5.6 than in perl 5.005, and has
confusing
documentation.

The right thing for Compress::Zlib is not to use unpack
"CCCV" but unpack
"UUUV", which seems completely weird to me, as no
unicode was ever
involved *on the perl level*.

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcggoof.com
      --==---/ / _ / // / / /      http://schmorp.de/
      -=====/_/_//_/_,_/ /_/_      XX11-RIPE

Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
user name
2007-03-30 20:05:06
Oh, maybe I know the reason for the confusion.

I do talk about the *Perl* level, while you often talk about
the
*implementation*. When I say byte or octet string below, I
mean on the
Perl level. For example, on the Perl level, upgrading a
string does not
change its semantics anywhere except w.r.t. to bugs and
unpack: It still
stays an octet string if it was an octet string before.

(Thats of course all in line with me not wanting to expose
the UTF-X
flag).

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcggoof.com
      --==---/ / _ / // / / /      http://schmorp.de/
      -=====/_/_//_/_,_/ /_/_      XX11-RIPE


[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )