(I've subscribed to P5P now, having recently had a fair bit
of indirect
interaction with the list.)
I've encountered quite a bit of XS code that ignores the
SvUTF8 flag
and operates on the underlying bytes of a PV instead of the
characters
being represented. It's quite a few years since the flag
was added, and
it doesn't seem like much progress has been made in the
proper use of it.
I suggest that this is a long-term problem that needs to be
addressed,
and that it should be addressed by this two-part strategy:
0. Make it easier to get at the UTF8 flag when pulling the
PV from a
scalar. Specifically, the SvPV() macro, which returns
pointer and
length, should be supplanted by a similar macro which, in
one go,
returns pointer, length, and encoding flag.
1. Deprecate the macros such as SvPV() that don't return the
encoding
flag. Looking up the old macros in perlapi(1) and other
such
documentation should result in the reader being directed
towards the
new macros, and hence towards UTF-8 awareness.
This should directly increase the proportion of new XS code
that
handles string scalars correctly. It also provides an easy
way to
screen existing code for likelihood of UTF-8 problems: any
call to the
UTF8-unaware macros is suspect. This makes the job of
revising all the
existing problematic code easier.
I think that if this is done then the initial stage,
defining the new
macro interfaces, should be done before 5.10.0 is released,
to minimise
compatibility issues for people using the new macros. The
mass updates
of modules will take years, so we might as well let it start
now.
Attached patch adds an SvPVu() macro to supplant SvPV(), and
similarly
modified versions of most other SvPV*() macros and related
functions.
Doesn't add any uses of the new interfaces, other than the
internal ones
in the implementation of the new interfaces.
-zefram
|