List Info

Thread: encouraging UTF-8 awareness




encouraging UTF-8 awareness
user name
2007-10-09 09:03:46
(I've subscribed to P5P now, having recently had a fair bit
of indirect
interaction with the list.)

I've encountered quite a bit of XS code that ignores the
SvUTF8 flag
and operates on the underlying bytes of a PV instead of the
characters
being represented.  It's quite a few years since the flag
was added, and
it doesn't seem like much progress has been made in the
proper use of it.
I suggest that this is a long-term problem that needs to be
addressed,
and that it should be addressed by this two-part strategy:

0. Make it easier to get at the UTF8 flag when pulling the
PV from a
   scalar.  Specifically, the SvPV() macro, which returns
pointer and
   length, should be supplanted by a similar macro which, in
one go,
   returns pointer, length, and encoding flag.

1. Deprecate the macros such as SvPV() that don't return the
encoding
   flag.  Looking up the old macros in perlapi(1) and other
such
   documentation should result in the reader being directed
towards the
   new macros, and hence towards UTF-8 awareness.

This should directly increase the proportion of new XS code
that
handles string scalars correctly.  It also provides an easy
way to
screen existing code for likelihood of UTF-8 problems: any
call to the
UTF8-unaware macros is suspect.  This makes the job of
revising all the
existing problematic code easier.

I think that if this is done then the initial stage,
defining the new
macro interfaces, should be done before 5.10.0 is released,
to minimise
compatibility issues for people using the new macros.  The
mass updates
of modules will take years, so we might as well let it start
now.

Attached patch adds an SvPVu() macro to supplant SvPV(), and
similarly
modified versions of most other SvPV*() macros and related
functions.
Doesn't add any uses of the new interfaces, other than the
internal ones
in the implementation of the new interfaces.

-zefram

  
Re: encouraging UTF-8 awareness
user name
2007-10-10 04:33:20
Rafael Garcia-Suarez wrote:
>This approach needs to be discussed with all the UTF-8
fixing and
>backwards-compatibility-breaking that is planned for
5.12.

Ah, I didn't know about that.  Where can I read up on it?

>SvPV don't always store characters, they might store
bytes, for which
>the encoding isn't relevant.

Huh?  In Perl characters and bytes are aliased; Encode(3)
explicitly
defines "byte" as a subrange of characters. 
Bytes, alias Latin-1
characters, certainly can be represented in a Perl scalar
encoded in UTF-8
with the SvUTF8 flag set.  If I had such a scalar, I would
not appreciate
a module ignoring the encoding flag.  I have in fact
encountered such
scalars unintentionally, which is what prompted me to fix my
XS crypto
modules (which operate strictly on byte strings) to pay
attention to
the encoding flag.

-zefram

Re: encouraging UTF-8 awareness
user name
2007-10-10 04:19:51
On 09/10/2007, Zefram <zeframfysh.org> wrote:
> I've encountered quite a bit of XS code that ignores
the SvUTF8 flag
> and operates on the underlying bytes of a PV instead of
the characters
> being represented.  It's quite a few years since the
flag was added, and
> it doesn't seem like much progress has been made in the
proper use of it.
> I suggest that this is a long-term problem that needs
to be addressed,
> and that it should be addressed by this two-part
strategy:

This approach needs to be discussed with all the UTF-8
fixing and
backwards-compatibility-breaking that is planned for 5.12. I
feel it's
too late at this point in 5.10 (almost code frozen) to get
new macros
in, esp. since it can be added later with Devel::PPPort.

> 0. Make it easier to get at the UTF8 flag when pulling
the PV from a
>    scalar.  Specifically, the SvPV() macro, which
returns pointer and
>    length, should be supplanted by a similar macro
which, in one go,
>    returns pointer, length, and encoding flag.
>
> 1. Deprecate the macros such as SvPV() that don't
return the encoding
>    flag.  Looking up the old macros in perlapi(1) and
other such
>    documentation should result in the reader being
directed towards the
>    new macros, and hence towards UTF-8 awareness.

SvPV don't always store characters, they might store bytes,
for which
the encoding isn't relevant.

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )