|
List Info
Thread: Re: the utf8 flag (was Re: decode_utf8 sets utf8 flag on plain ascii strings)
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 19:04:53 |
Marc Lehmann skribis 2007-03-31 1:33 (+0200):
> The difference between us, and thats what it boils down
to, is that you give
> the internal UTF-X bit meaning. You equate UTF-X flag
set == Unicode string.
No, that's a unidirectional thing.
I've said it on p5p at least a dozen times, but I'll say it
again:
If the UTF8 flag is set, you can be sure that you have a
text string.
If the UTF8 flag is not set, it can be either a byte string
or a text
string.
If you have a text string, the UTF8 flag may or not be set.
If you have
a byte string, the UTF8 string is not set (or it was set
because you
treated the byte string as a text string).
> The problem with your approach is that you have to
expose the UTF-X flag
> to users. Which comes with a lot of problems.
Again: you're kidding, right?
I'm constantly very explicitly and verbosely telling people
to NOT look
at the flag, NOT set it manually, etcetera.
Heck, I've even explained that I think you should try to
(pretend to) be
ignorant about the internals, in response to your message
even!
I do not understand how you are able to misinterpret this
message even
after this many posts in this thread alone. Have you ever
read
perlunitut, even?
> Initially I thought you, too, wanted a unicode model
where the UTF-X bit is
> not exposed to the perl level. But in fact the opposite
is true: you
> forc> knowledge of the UTF-X bit on users, even
though it should be
> transparent.
> ...
> the problem is you want them to track the UTF-X flag in
addition to that.
> ...
> Then why do you want to force people to know about how
> 128..255 is encoded internally then?
That's not what I said, nor what I meant. In fact, quite the
opposite.
If you're just spending this evening just to get on my
nerves, then
congratulations!
> > Oh, but they do. Please read perlunitut, which
tries to redefine the
> > universe into four important definitions (and
succeeds).
> I do not have that manpage.
http://www.google.com/search?q=perlunitut&btnI=I'm+F
eeling+Lucky
> Because "internal format" strings can store
binary data just as well,
> and often does.
Yes, and when you use such a byte string as a text string,
its bytes are
considered to be codepoints, just like in latin1.
> I am talking purely about the perl level strings. If
perlunitut confused
> the issue by talking about internal encoding it
completely failed its
> mission, imho.
I strongly suggest that you READ the document before whining
about its
supposed failure.
> The problem is that some parts of perl make a
difference bewteen the
> very same string, depending on how it is encoded
internally, _even if
> the encoding is the same on the Perl level_.
Those are bugs. Report them, and they might get fixed.
> > utf8::encode is a text operation. It will assume
that whatever you give
> > it, is a text string. Its characters are
considered Unicode codepoints.
> Where does it say so?
Well, you have already denied that "encoding is going
from characters to
bytes" is a real world fact, so I guess there's little
point in pointing
out the places where exactly the same thing is explained.
> > you need to know some internals.
> Wrong. I need know no internals
A certain Marc Lehmann once said:
"I would love if that were the case, but the powers to
be decided that
every perl progarmmer has to know those internals, and needs
to be able
to deal with them."
> > That makes no sense, because UTF-8 is a means of
representing
> > characters. Byte strings consist of bytes, not
characters.
> Not in C, which is what the documentation constantly
refers to, mind
> you.
And that is bad, I agree. Perl programmers should not be
expected to
speak C in order to understand Perl documentation. This is a
big
problem in Perl's documentation, but who's going to fix it?
--
korajn salutojn,
juerd waalboer: perl hacker <juerd juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy
<sales convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <ht
tp://www.wijvertrouwenstemcomputersniet.nl/>.
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 21:15:43 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Moin,
On Saturday 31 March 2007 00:04:53 Juerd Waalboer wrote:
> Marc Lehmann skribis 2007-03-31 1:33 (+0200):
> > The difference between us, and thats what it boils
down to, is that you
> > give the internal UTF-X bit meaning. You equate
UTF-X flag set ==
> > Unicode string.
>
> No, that's a unidirectional thing.
>
> I've said it on p5p at least a dozen times, but I'll
say it again:
>
> If the UTF8 flag is set, you can be sure that you have
a text string.
> If the UTF8 flag is not set, it can be either a byte
string or a text
> string.
>
> If you have a text string, the UTF8 flag may or not be
set.
So you are basically saying that you can have any string
(text or byte) with
either the flag set, or not. Er, and how do we find out
which combination
is which?
I think we all should go to bed and have a nice rest. What
you wrote above
makes no sense at all to me now anymore.
So, good night for now,
Te"pls don't tell mom it's 2 already"ls
- --
Signed on Sat Mar 31 02:12:15 2007 with key 0x93B84C15.
Get one of my photo posters: http://bloodgate.com/pos
ters
PGP key on http://bloodgate.com/te
ls.asc or per email.
Like my code? Want to hire me to write some code for you?
Send email!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iQEVAwUBRg3ET3cLPEOTuEwVAQJ7QQf/QmX+IUIaVxgJMSfCrGFnQDRlXzKE
HXBk
fIsz1cCNmwPeRJsskLxxkRsC2TlufgccRx3RSN0HcI56l79ldBAvN7uqNgRH
EZ2x
JRsIFdT6B13YPFwjAsnSNwl9kIYoRmaXVsFugQELqIbKAKqe/7BGCgnG9qLf
N8a0
n6+T3tbpoyWL5MWcDGi6Z+r+GL3bb3GQQQY9GHa4sNU5aWsDcdEOTM9g9KKg
INY1
0OIt5nXxPjLEcpOsuqxFA/Xk9kA/EPr/oz4VpZN+9WlahBkL31BJ5Vb3QjbC
6eo5
amOAJ+qg04jFu2rLTMBtjunc+/Hvebiz8JsK1Bcb5VeG3GEJKKRTRw==
=2fEH
-----END PGP SIGNATURE-----
|
|
| Re: the utf8 flag (was Re: decode_utf8
sets utf8 flag on plain ascii strings) |

|
2007-03-30 19:42:58 |
On Sat, Mar 31, 2007 at 02:04:53AM +0200, Juerd Waalboer
<juerd convolution.nl> wrote:
> I've said it on p5p at least a dozen times, but I'll
say it again:
>
> If the UTF8 flag is set, you can be sure that you have
a text string.
Repeating wrong statements does not make them true.
> If you have a text string, the UTF8 flag may or not be
set. If you have
> a byte string, the UTF8 string is not set (or it was
set because you
> treated the byte string as a text string).
No, please look at my example of JSON.
> > The problem with your approach is that you have to
expose the UTF-X flag
> > to users. Which comes with a lot of problems.
>
> Again: you're kidding, right?
>
> I'm constantly very explicitly and verbosely telling
people to NOT look
> at the flag, NOT set it manually, etcetera.
So why do you propose that people have to make sure that
they never put a
binary string with the UTF-X flag set into unpack?
How are users supposed to do that, unless they know about he
flag in the
first place?
No, I am not kidding. You are part of the crowd who wants to
expose the
UTF-X flag to the perl level, despite your claims that you
do not want to.
> Heck, I've even explained that I think you should try
to (pretend to) be
> ignorant about the internals, in response to your
message even!
Right, and then you want perl functions to die depending on
the setting of
that flag, even though you also claim Perl users should not
need to know
about it.
So you tell users when they get that error message that they
did somethign
wrong that they should not care about?
No, I am certainly not kidding.
> I do not understand how you are able to misinterpret
this message even
> after this many posts in this thread alone. Have you
ever read
> perlunitut, even?
As I said, I have no such manpage, and even if I had, it has
nothing to do
with this. I am not misinterpreting your message at all.
You want perl functions to behave different depending on
wether that flag is
set or not. I want perl functions to behave the same,
regardless of the fact.
You expose the UTF-X flag that way. I don't.
You *are* contradicting yourself, but that has nothing to do
with me not
reading that document or not. Thats alone your problem.
Either you do expose the UTF-X flag by making perl functions
behave
differently, or you don't.
No matter of claiming you donot want to expose it can fix
that: You do,
wether you want or not, if you change Perl semantics to make
a difference.
> That's not what I said, nor what I meant. In fact,
quite the opposite.
So then unpack should not croak when it sees the UTF-X
flag?
> If you're just spending this evening just to get on my
nerves, then
> congratulations!
No, I am trying to make you understand the typeless nature
of Perl, and
that your proposals expose the UTF-X flag, no matter what
you *want*.
You could just understand that for a change, then maybe you
wouldn't need to
accuse me of just trying to get on your nerves.
I do understand that you said you do not want to expose that
flag. But as
long as the changes you propose do that, it is being
exposed.
I am sorry that I can't say it any clearer.
> > Because "internal format" strings can
store binary data just as well,
> > and often does.
>
> Yes, and when you use such a byte string as a text
string, its bytes are
> considered to be codepoints, just like in latin1.
Yeah, sure. Mind you: no mention of UTF-X.
> > I am talking purely about the perl level strings.
If perlunitut confused
> > the issue by talking about internal encoding it
completely failed its
> > mission, imho.
>
> I strongly suggest that you READ the document before
whining about its
> supposed failure.
Well, I trust that you don't misquote its contents. Did
you?
> > The problem is that some parts of perl make a
difference bewteen the
> > very same string, depending on how it is encoded
internally, _even if
> > the encoding is the same on the Perl level_.
>
> Those are bugs. Report them, and they might get fixed.
I did. Thats the whole point of this thread. I reported them
a number of
times. How could you miss that?
> > > utf8::encode is a text operation. It will
assume that whatever you give
> > > it, is a text string. Its characters are
considered Unicode codepoints.
> > Where does it say so?
>
> Well, you have already denied that "encoding is
going from characters to
> bytes" is a real world fact, so I guess there's
little point in pointing
> out the places where exactly the same thing is
explained.
If it is wrong, its wrong. No matter how often you try to
explain
it. People do store octets in UTF-8. Even perl extends UTF-8
to UTF-X
to make interesting usages possible. So yes, if thats
broken, then Pelr
is already broken, fundamentally, by allowing
non-unicode-codepoints in
strings.
Choose two: your claims are wrong, or Perl is wrong. Either
way suits me,
although I personally think the current model makes much
more sense then
your user-has-to-care-for-UTF-X flag explicitly model.
> > > you need to know some internals.
> > Wrong. I need know no internals
>
> A certain Marc Lehmann once said:
>
> "I would love if that were the case, but the
powers to be decided that
> every perl progarmmer has to know those internals, and
needs to be able
> to deal with them."
Yes. Any problems with that?
As you like to quote with misleading context, let me add
that the context
was unpack and perl modules using it or XS, not
utf8::encode.
You make a classical logical fallacy: just because some
parts of Perl do
not force you to know internals this does not mean that all
of Perl does
not force you.
> > > That makes no sense, because UTF-8 is a means
of representing
> > > characters. Byte strings consist of bytes,
not characters.
> > Not in C, which is what the documentation
constantly refers to, mind
> > you.
>
> And that is bad, I agree. Perl programmers should not
be expected to
> speak C in order to understand Perl documentation. This
is a big
> problem in Perl's documentation, but who's going to fix
it?
I donot suffer from it. I just want sane behaviour in Perl,
which doesn't
force me to think about wether my UTF-X flag could be set
and my program
could die because of that, but where I get the correct and
expected
results.
--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg goof.com
--==---/ / _ / // / / / http://schmorp.de/
-=====/_/_//_/_,_/ /_/_ XX11-RIPE
|
|
[1-3]
|
|