|
List Info
Thread: Re: What to do for bytes in 2.6?
|
|
| Re: What to do for bytes in 2.6? |

|
2008-01-19 13:32:46 |
On Jan 19, 2008 10:53 AM, Neil Schemenauer <nas arctrix.com> wrote:
> Guido van Rossum <guido python.org> wrote:
> > bytes is an alias for str (not even a subclass)
> > b"" is an alias for ""
>
> One advantage of a subclass is that there could be a
flag that warns
> about combining bytes and unicode data. For example,
b"x" + u"y"
> would produce a warning. As someone who writes
internationalized
> software, I would happly use both the byte designating
syntax and
> the warning flag, even if I wasn't planning to move to
Python 3.
Yeah, that's what everybody thinks at first -- but the
problem is that
most of the time (in 2.6 anyway) the "bytes"
object is something read
from a socket, for example, not something created from a
b"" literal.
There is no way to know whether that return value means text
or data
(plenty of apps legitimately read text straight off a socket
in 2.x),
so the socket object can't return a bytes instance, and
hence the
warning won't trigger.
Really, the pure aliasing solution is just about optimal in
terms of
bang per buck.
--
--Guido van Rossum (home page: http://www.python.org/~
guido/)
_______________________________________________
Python-Dev mailing list
Python-Dev python.org
ht
tp://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/p
ython-dev/nessto%40sharedlog.com
|
|
| Re: What to do for bytes in 2.6? |
  United States |
2008-01-19 19:54:48 |
On 19 Jan, 07:32 pm, guido python.org wrote:
>There is no way to know whether that return value means
text or data
>(plenty of apps legitimately read text straight off a
socket in 2.x),
IMHO, this is a stretch of the word "legitimately"
.
If you're
reading from a socket, what you're getting are bytes,
whether they're
represented by str() or bytes(); correct code in 2.x must
currently do a
.decode("ascii") or .decode("charmap")
to "legitimately" identify the
result as text of some kind.
Now, ad-hoc code with a fast and loose definition of
"text" can still
read arrays of bytes off a socket without specifying an
encoding and get
away with it, but that's because Python's unicode
implementation has
thus far been very forgiving, not because the data is
cleanly text yet.
Why can't we get that warning in -3 mode just the same from
something
read from a socket and a b"" literal? I've
written lots of code that
aggressively rejects str() instances as text, as well as
unicode
instances as bytes, and that's in code that still supports
2.3 ;).
>Really, the pure aliasing solution is just about optimal
in terms of
>bang per buck.
Not that I'm particularly opposed to the aliasing solution,
either. It
would still allow writing code that was perfectly useful in
2.6 as well
as 3.0, and it would avoid disturbing code that did checks
of type("").
It would just remove an opportunity to get one potentially
helpful
warning.
_______________________________________________
Python-Dev mailing list
Python-Dev python.org
ht
tp://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/p
ython-dev/nessto%40sharedlog.com
|
|
| Re: What to do for bytes in 2.6? |

|
2008-01-19 22:26:43 |
On Jan 19, 2008 5:54 PM, <glyph divmod.com> wrote:
> On 19 Jan, 07:32 pm, guido python.org wrote:
> >There is no way to know whether that return value
means text or data
> >(plenty of apps legitimately read text straight off
a socket in 2.x),
>
> IMHO, this is a stretch of the word
"legitimately" . If
you're
> reading from a socket, what you're getting are bytes,
whether they're
> represented by str() or bytes(); correct code in 2.x
must currently do a
> .decode("ascii") or
.decode("charmap") to "legitimately"
identify the
> result as text of some kind.
>
> Now, ad-hoc code with a fast and loose definition of
"text" can still
> read arrays of bytes off a socket without specifying an
encoding and get
> away with it, but that's because Python's unicode
implementation has
> thus far been very forgiving, not because the data is
cleanly text yet.
I would say that depends on the application, and on
arrangements that
client and server may have made off-line about the
encoding.
In 2.x, text can legitimately be represented as str --
there's even
the locale module to further specify how it is to be
interpreted as
characters.
Sure, this doesn't work for full unicode, and it doesn't
work for all
protocols used with sockets, but claiming that only fast and
loose
code ever uses str to represent text is quite far from
reality -- this
would be saying that the locale module is only for quick and
dirty
code, which just ain't so.
> Why can't we get that warning in -3 mode just the same
from something
> read from a socket and a b"" literal?
If you really want this, please think through all the
consequences,
and report back here. While I have a hunch that it'll end up
giving
too many false positives and at the same time too many
false
negatives, perhaps I haven't thought it through enough. But
if you
really think this'll be important for you, I hope you'll be
willing to
do at least some of the thinking.
I believe that a constraint should be that by default
(without -3 or a
__future__ import) str and bytes should be the same thing.
Or, another
way of looking at this, reads from binary files and reads
from sockets
(and other similar things, like ctypes and mmap and the
struct module,
for example) should return str instances, not instances of a
str
subclass by default -- IMO returning a subclass is bound to
break too
much code. (Remember that there is still *lots* of code out
there that
uses "type(x) is types.StringType)" rather than
"isinstance(x, str)",
and while I'd be happy to warn about that in -3 mode if we
could, I
think it's unacceptable to break that in the default
environment --
let it break in 3.0 instead.)
> I've written lots of code that
> aggressively rejects str() instances as text, as well
as unicode
> instances as bytes, and that's in code that still
supports 2.3 ;).
Yeah, well, but remember, while keeping you happy is high on
my list
of priorities, it's not the only priority.
> >Really, the pure aliasing solution is just about
optimal in terms of
> >bang per buck.
>
> Not that I'm particularly opposed to the aliasing
solution, either. It
> would still allow writing code that was perfectly
useful in 2.6 as well
> as 3.0, and it would avoid disturbing code that did
checks of type("").
Right.
> It would just remove an opportunity to get one
potentially helpful
> warning.
I worry that the warning wouldn't come often enough, and
that too
often it would be unhelpful. There will inevitably be some
stuff where
you just have to try to convert the code using 2to3 and try
to run it
under 3.0 in order to see if it works. And there's also the
concern of
those who want to use 2.6 because it offers 2.5
compatibility plus a
fair number of new features, but who aren't interested (yet)
in moving
up to 3.0. I expect that Google will initially be in this
category
too.
--
--Guido van Rossum (home page: http://www.python.org/~
guido/)
_______________________________________________
Python-Dev mailing list
Python-Dev python.org
ht
tp://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/p
ython-dev/nessto%40sharedlog.com
|
|
| Re: What to do for bytes in 2.6? |
  United States |
2008-01-20 01:49:56 |
On 04:26 am, guido python.org wrote:
>On Jan 19, 2008 5:54 PM, <glyph divmod.com> wrote:
>>On 19 Jan, 07:32 pm, guido python.org wrote:
Starting with the most relevant bit before getting off into
digressions
that may not interest most people:
>>Why can't we get that warning in -3 mode just the
same from something
>>read from a socket and a b"" literal?
>If you really want this, please think through all the
consequences,
>and report back here. While I have a hunch that it'll
end up giving
>too many false positives and at the same time too many
false
>negatives, perhaps I haven't thought it through enough.
But if you
>really think this'll be important for you, I hope you'll
be willing to
>do at least some of the thinking.
While I stand by my statement that unicode is the Right Way
to do text
in python, this particular feature isn't really that
important, and I
can see there are cases where it might cause problems or
make life more
difficult. I suspect that I won't really know whether I
want the
warning anyway before I've actually tried to port any
nuanced, real
text-processing code to 3.0, and it looks like it's going to
be a little
while before that happens. I suspect that if I do want the
warning, it
would be a feature for 2.7, not 2.6, so I don't want to
waste a lot of
everyone's time advocating for it.
Now for a nearly irrelevant digression (please feel free to
stop reading
here):
>>Now, ad-hoc code with a fast and loose definition of
"text" can still
>>read arrays of bytes off a socket without specifying
an encoding and
>>get
>>away with it, but that's because Python's unicode
implementation has
>>thus far been very forgiving, not because the data
is cleanly text
>>yet.
>
>I would say that depends on the application, and on
arrangements that
>client and server may have made off-line about the
encoding.
I can see your point. I think it probably holds better on
files and
streams than on sockets, though - please forgive me if I
don't think
that server applications which require environment-dependent
out-of-band
arrangements about locale are correct .
>In 2.x, text can legitimately be represented as str --
there's even
>the locale module to further specify how it is to be
interpreted as
>characters.
I'm aware that this specific example is kind of a ridiculous
stretch,
but it's the first one that came to mind. Consider
len(u'é'.encode('utf-8').rjust(5).decode('utf-8')). Of
course
unicode.rjust() won't do the right thing in the case of
surrogate pairs,
not to mention RTL text, but it still handles a lot more
cases than
str.rjust(), since code points behave a lot more like
characters than
code units do.
>Sure, this doesn't work for full unicode, and it doesn't
work for all
>protocols used with sockets, but claiming that only fast
and loose
>code ever uses str to represent text is quite far from
reality -- this
>would be saying that the locale module is only for quick
and dirty
>code, which just ain't so.
It would definitely be overreaching to say all code that
uses str is
quick and dirty. But I do think that it fits into one of
two
categories: quick and dirty, or legacy. locale is an
example of a
legacy case for which there is no replacement (that I'm
aware of). Even
if I were writing a totally unicode-clean application, as
far as I'm
aware, there's no common replacement for i.e.
locale.currency().
Still, locale is limiting. It's ... uncomfortable to call
locale.currency() in a multi-user server process. It would
be nice if
there were a replacement that completely separated encoding
issues from
localization issues.
>I believe that a constraint should be that by default
(without -3 or a
>__future__ import) str and bytes should be the same
thing. Or, another
>way of looking at this, reads from binary files and
reads from sockets
>(and other similar things, like ctypes and mmap and the
struct module,
>for example) should return str instances, not instances
of a str
>subclass by default -- IMO returning a subclass is bound
to break too
>much code. (Remember that there is still *lots* of code
out there that
>uses "type(x) is types.StringType)" rather
than "isinstance(x, str)",
>and while I'd be happy to warn about that in -3 mode if
we could, I
>think it's unacceptable to break that in the default
environment --
>let it break in 3.0 instead.)
I agree. But, it's precisely because this is so subtle that
it would be
nice to have tools which would report warnings to help fix
it.
*Certainly* by default, everywhere that's "str" in
2.5 should be "str"
in 2.6. Probably even in -3 mode, if the goal there is
"warnings only".
However, the feature still strikes me as potentially useful
while
porting. If I were going to advocate for it, though, it
would be as a
separate option, e.g. "--separate-bytes-type". I
say this as separate
from just trying to run the code on 3.0 to see what happens
because it
seems like the most subtle and difficult aspect of the port
to get
right; it would be nice to be able to tweak it individually,
without the
other issues related to 3.0. For example, some of the code
I work on
has a big stack of dependencies. Some of those are in C,
most of them
don't process text at all. However, most of them aren't
going to port
to 3.0 very early, but it would be good to start running in
as 3.0-like
of an environment as possible earlier than that so that the
hard stuff
is done by the time the full stack has been migrated.
>>I've written lots of code that
>>aggressively rejects str() instances as text, as
well as unicode
>>instances as bytes, and that's in code that still
supports 2.3 ;).
>
>Yeah, well, but remember, while keeping you happy is
high on my list
Thanks, good to hear
>of priorities, it's not the only priority.
I don't think it's even my fiancée's *only* priority, and I
think it
should stay higher on her list than yours .
_______________________________________________
Python-Dev mailing list
Python-Dev python.org
ht
tp://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/p
ython-dev/nessto%40sharedlog.com
|
|
| Re: What to do for bytes in 2.6? |
  United States |
2008-01-20 08:56:59 |
On Sat, Jan 19, 2008, Guido van Rossum wrote:
>
> I believe that a constraint should be that by default
(without -3 or a
> __future__ import) str and bytes should be the same
thing. Or, another
> way of looking at this, reads from binary files and
reads from sockets
> (and other similar things, like ctypes and mmap and the
struct module,
> for example) should return str instances, not instances
of a str
> subclass by default -- IMO returning a subclass is
bound to break too
> much code.
This makes perfect sense to me. And yet, I also like the
idea that
b""+u"" raises an exception. I have a
suggestion, then: for 2.6, let's
make bytes a subclass of string whose only difference is
that it contains
a flag. I don't care whether b""+u""
raises an exception. This should
be enough for people on an accelerated 3.0 conversion
schedule, and they
can write their own test for the flag if they care.
For 2.7, we can start tightening up.
b""+u"" can raise an exception,
and the socket module might use a RETURN_BYTES variable.
To put it another way, I don't think we should consider 2.6
the stopping
point for 2.x/3.x conversion help. There's nothing wrong
with planning
for features to go into 2.7 -- just as several PEPs in the
past have had
multi-version planning.
--
Aahz (aahz pythoncraft.com) <*> http://www.pythoncraft.co
m/
"All problems in computer science can be solved by
another level of
indirection." --Butler Lampson
_______________________________________________
Python-Dev mailing list
Python-Dev python.org
ht
tp://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/p
ython-dev/nessto%40sharedlog.com
|
|
| Re: What to do for bytes in 2.6? |

|
2008-01-20 15:04:44 |
On Jan 20, 2008 6:56 AM, Aahz <aahz pythoncraft.com> wrote:
> On Sat, Jan 19, 2008, Guido van Rossum wrote:
> >
> > I believe that a constraint should be that by
default (without -3 or a
> > __future__ import) str and bytes should be the
same thing. Or, another
> > way of looking at this, reads from binary files
and reads from sockets
> > (and other similar things, like ctypes and mmap
and the struct module,
> > for example) should return str instances, not
instances of a str
> > subclass by default -- IMO returning a subclass is
bound to break too
> > much code.
>
> This makes perfect sense to me. And yet, I also like
the idea that
> b""+u"" raises an exception. I
have a suggestion, then: for 2.6, let's
> make bytes a subclass of string whose only difference
is that it contains
> a flag.
This still begs the question which standard APIs should
return bytes.
> I don't care whether b""+u"" raises
an exception. This should
> be enough for people on an accelerated 3.0 conversion
schedule, and they
> can write their own test for the flag if they care.
Well, it being a subclass, it doesn't even need to have a
flag, right?
The class itself acts as a flag.
But still, without support from built-in and standard
library APIs I
fear it's not going to be very useful.
And fixing all the standard APIs to make the correct
distinction is
going to create exactly the ripple effect that Raymond so
desperately
wants to avoid -- and I agree, to the extent that rippling
this
through the stdlib is a waste of time from the stdlib's POV
-- it's
already been 3.0-ified.
> For 2.7, we can start tightening up.
b""+u"" can raise an exception,
> and the socket module might use a RETURN_BYTES
variable.
>
> To put it another way, I don't think we should consider
2.6 the stopping
> point for 2.x/3.x conversion help. There's nothing
wrong with planning
> for features to go into 2.7 -- just as several PEPs in
the past have had
> multi-version planning.
Personally, I very much doubt there will *be* a 2.7. I
certainly don't
expect to participate in its development the same way I am
trying to
steer 2.6.
--
--Guido van Rossum (home page: http://www.python.org/~
guido/)
_______________________________________________
Python-Dev mailing list
Python-Dev python.org
ht
tp://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/p
ython-dev/nessto%40sharedlog.com
|
|
|
|