List Info

Thread: stronger type determination (was Re: the utf8 flag ...)




stronger type determination (was Re: the utf8 flag ...)
user name
2007-03-31 01:18:44
Juerd Waalboer said on March 28, 2007 02:13:
>  What I want (and I think you want too) is a real type
system, to 
>have two different distinct types: byte strings and
character 
>strings. It would be bad to use a flag called
"UTF8" for this, 
>because a byte string can also be UTF8 encoded. Perl
already suffers 
>from this problem, but because the UTF8 flag is
*INTERNAL*, it's not 
>a big deal. It would be if it surfaced and was used by
Perl coders.

Yes, a stronger type system is exactly what I want, and that
is what 
my example library (in real life, QDRDBMS) wants to use
internally; 
it internally treats character data (which is encoding
agnostic) and 
binary data (undifferentiated bits) and integers and
non-integer 
numbers all as disjoint data types that must be explicitly
converted 
between.  The aforementioned 4 are like Perl 6's Str, Blob,
Int, Num, 
but that Perl 6 provides implicit conversion in many cases.

I want to emphasize here that I am knowingly wanting to
access 
details that normal programmers, and users of my library,
shouldn't 
have to know about, because I am conceptually enhancing the
language 
itself, though most concequences of that occur behind a
wall.

Part of my rationale here is that I want my library to be
highly 
deterministic, which means there should be zero ambiguity as
to what 
the input data is, and its semantics should be consistent
and easy to 
understand.

A Perl 5 string with its utf8 flag off is ambiguous if we
want to 
treat it as anything other than an undifferentiated string
of bytes. 
If it is character data, there is a wide multitude of
encodings that 
it could possibly be; latin-1 is just one of many 8-bit
encodings for 
example.

I prefer for my library to only accept strongly vetted and 
unambiguous data, and let the user program deal with the
consequences 
of Perl 5's weak scalar type system, where they explicitly
resolve 
themselves weak values into strong ones.  I'm not just going
to 
*assume* that strings with the bit off are latin-1.

I will note that the user invoking Encode routines or
setting 
filehandle traits is an explicit action on their part, so
conversion 
between bytes and characters *is* being done explicitly, and
so users 
are thinking about it and the results should not be
ambiguous.

>  A whole type system is a bit too much to implement in
Perl 5, I 
>think. Our current unicode string semantics are a great
way to deal 
>with not having types, in my opinion.

While Perl 5 doesn't officially have a strong type system,
unlike 
Perl 6, I do recognize that it does still conceive each
scalar value 
as one of several distinct data types internally, and this
is largely 
exposed in the language, and I want to exploit it so that I
can get 
as close to strong semantics as I can under the
circumstances.

For example, is_utf8() to my mind says whether Perl says a
scalar is 
considered to be characters (internal encoding doesn't
matter) or 
undifferentiated bits, and in normal cases that flag would
be set 
true by something like a successful invocation of 
Encode::decode_utf8(), since that function vetted the data
and so 
moved the string from ambiguous to something unambiguous.

Since Perl 5 lacks strong data types in the general sense,
unlike 
Perl 6, I am trying the best I can to use whatever clues
Perl 5 can 
give me, such as that flag, or access to some internal flag
to say 
whether a scalar is in string or number mode.

Frankly, I would like to easily pass/fail on these
examples:

   wants_int( 42 ); # allows
   wants_int( "42" ); # routine throws exception
   wants_int( 0+$foo ); # allows
   wants_int( ''.$foo ); # routine throws exception

   wants_text( 42 ); # routine throws exception
   wants_text ( "42" ); # allows
   wants_text ( 0+$foo ); # routine throws exception
   wants_text ( ''.$foo ); # allows

This is assuming that Perl actually records 42 and
"42" differently; 
if it doesn't, then I won't ask for the ability to
discriminate since 
Perl itself doesn't; but if Perl treats those differently, I
want to 
as well.

Juerd also said:
>  How often should Perl check for this? Directly after
decoding only, 
>or also after mutating operations like substr, or s///?

The utf8 flag being turned on or off only happens eg as a
result of 
decode() or encode(); a string mutation would not change
it.

As a corollary to what I said before, pack() should always
return a 
string with the flag off, as its result is a bit string, and
likewise 
the string that unpack() decodes should be expected to have
the 
string off, because its actual bit pattern is significant.

Also, it should be an error for, eg, $raw_jpeg_image_data to
have the 
utf8 flag on, since it is obviously a bit pattern.

-- Darren Duncan

Re: stronger type determination (was Re: the utf8 flag ...)
user name
2007-03-31 07:48:17
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Saturday 31 March 2007 06:18:44 Darren Duncan wrote:
> Juerd Waalboer said on March 28, 2007 02:13:
> >  What I want (and I think you want too) is a real
type system, to
> >have two different distinct types: byte strings and
character
> >strings. It would be bad to use a flag called
"UTF8" for this,
> >because a byte string can also be UTF8 encoded.
Perl already suffers
> >from this problem, but because the UTF8 flag is
*INTERNAL*, it's not
> >a big deal. It would be if it surfaced and was used
by Perl coders.
>
> Yes, a stronger type system is exactly what I want, and
that is what
> my example library (in real life, QDRDBMS) wants to use
internally;
> it internally treats character data (which is encoding
agnostic) and
> binary data (undifferentiated bits) and integers and
non-integer
> numbers all as disjoint data types that must be
explicitly converted
> between.  The aforementioned 4 are like Perl 6's Str,
Blob, Int, Num,
> but that Perl 6 provides implicit conversion in many
cases.
>
> I want to emphasize here that I am knowingly wanting to
access
> details that normal programmers, and users of my
library, shouldn't
> have to know about, because I am conceptually enhancing
the language
> itself, though most concequences of that occur behind a
wall.
>
> Part of my rationale here is that I want my library to
be highly
> deterministic, which means there should be zero
ambiguity as to what
> the input data is, and its semantics should be
consistent and easy to
> understand.
>
> A Perl 5 string with its utf8 flag off is ambiguous if
we want to
> treat it as anything other than an undifferentiated
string of bytes.
> If it is character data, there is a wide multitude of
encodings that
> it could possibly be; latin-1 is just one of many 8-bit
encodings for
> example.
>
> I prefer for my library to only accept strongly vetted
and
> unambiguous data, and let the user program deal with
the consequences
> of Perl 5's weak scalar type system, where they
explicitly resolve
> themselves weak values into strong ones.  I'm not just
going to
> *assume* that strings with the bit off are latin-1.

Thank you for summing this up so nicely. I strongly (no pun
intended) agree 
with what you wrote. For "normal" Perl scripts I
can live with 
the "assuming" stage, but when I write a library,
I do not want my data to 
morph under me - my code shouldn't need to guess 

All the best,

Tels

- -- 
 Signed on Sat Mar 31 12:44:57 2007 with key 0x93B84C15.
 Get one of my photo posters: http://bloodgate.com/pos
ters
 PGP key on http://bloodgate.com/te
ls.asc or per email.

 "My glasses, my glasses. I cannot see without my
glasses."
 - "My
 glasses, my glasses. I cannot be seen without my
glasses."

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg5YkXcLPEOTuEwVAQKaXgf/QjCkKz0kTVUII8+ZdeH10vpdG2De
Gjjo
ZmYtjCBHCqOOAp+GmjlHzLOCqV6bnNAiNJ8TlgiOul3ECH27p9d6djpO7jhQ
6A/t
+zpepgFp6coa5Rlv6cSr3STDj6TDdV7HubhDvWn63VVCyrsweBSg/PTQfBHb
dRew
mKcN1/Iv37wdsvle2Mxg/lQW0WyTorCVW/bTYrNfM6yYI3Xzvv3mONbYDYh/
jHqZ
lMsUxGgwlKjAMi2Cs2y+zNfbZn7zcqbBhol68v/k9ytgf+gMzEsXNnYOw3df
FuuE
K9575Y7E/8g5coley7e7hYLBTg+3P2DDvKmZsmetCUaJUrpnzmRBPg==
=f5+D
-----END PGP SIGNATURE-----

Re: stronger type determination (was Re: the utf8 flag ...)
user name
2007-03-31 11:15:06
Darren Duncan skribis 2007-03-30 23:18 (-0700):
> For example, is_utf8() to my mind says whether Perl
says a scalar is 
> considered to be characters 

Then your mind needs repairing 

But you're not alone in your thinking. The same brain waves
broke the
regex engine.

Every string consists of characters.

my $eacute1 = $eacute2 = chr 233;
utf8::upgrade($eacute2);

# Different encoding internally, different state of the UTF8
flag, but:

is($eacute1, $eacute2);

Both have the same single character, even though the
internal
representation was forced to change.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerdjuerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy
<salesconvolution.nl>

Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )