|
List Info
Thread: proposed change in utf8 filename semantics
|
|
| proposed change in utf8 filename
semantics |

|
2007-09-18 15:03:50 |
Hello,
For some time I was wondering how it would be optimal to
incorporate utf8 in
filenames in perl. The problem, as it seems to me, is that
unicode support in
this regard is not orthogonal in Perl, because things like
locales and files IO
can be easily managed by 'use locale' and IO layers, whereas
unicode characters
in file names are left out. For unix this was never a
problem, because
there is no special syntax for filenames in unicode, but for
win32 it is, and
working with files that contain unicode letters outside
current locale is a
real trouble.
I would therefore like to propose a (non-default) change in
semantics, that
will use OS-level unicode API (for win32, wide-char API)
when avaialble, and if
explicitly asked. The semantics has two aspects:
1. When a filename-related function is called with a
filename scalar that has
SvUTF8 bit set, the function will try to use OS-level
unicode API -- if
present. On win32, functions like win32_stat() will check
the utf8 context
hints, and depending on the value, will call either stat()
or wstat(). For OSes
where no special API is present, no changes in the code is
needed, and no
addition runtime expenses are incurred.
2. Functions that return file names, like readdir(), are
taught to
differentiate between bytes and utf8 context, regardless of
whether OS supports
unicode API or not. I propose to extend syntax of binmode so
that two new
calls
binmode( DIRHANDLE, ':utf8')
binmode( DIRHANDLE, ':bytes')
will be recognized, and depending on the last such call,
readdir() will return
filenames either with or without SvUTF8 flag on. Again, OS
unicode API will be
used where supported, and where it is not, no additional
code is required. In
':utf8' mode, all results of PerlIO_readdir() will be simply
flagged with
SvUTF8, and the validity of utf8 string can be later checked
with utf8::valid,
if necessary.
I'm attaching a patch against 5.10.0 that implements this
new behavior for
stat(), opendir(), and readdir() only. I'm unsure whether
this patch would be
considered good enough for inclusion, so I don't want to
spend more time on
implementing all filename-related functions yet. OTOH if
someone would want to
help me with the implementation, that would be really
great.
The patch is split in two sections, one for the code and
another for the
configuration files. The code patch concerns only .c and .h
files, and is
fairly complete. I'm unsure though about the configuration
patch - it applies
changes to Configure and win32/config.*, but there are many
more pre-complied
config templates for other platforms, so I didn't touch
these, and would like
to ask someone to tell me what did I miss ( I basically need
to add a new config
variable utf8filenamesemantics).
Of course I'm completely unsure if the idea with the new
utf8 filename
semantics will be accepted at all. I understand that it is
win32 users that
will benefit most from it, because on unix simple
'Encode::_utf8_on($_) for
readdir' is all that is needed to treat filenames as
unicode. Nevertheless,
if accepted, there will be some little more uniformity in
Perl's cross-platform
filename and unicode handing.
This is my first Perl patch, so if I broke some rules here,
please don't just
stay silent, tell me what can be done better. I tested it on
win32 and freebsd
and linux, seems to be working as expected, I don't know
what else should I
test. Please review and/or test it too. To enable it for
win32, define
UTF8_FILENAME_SEMANTICS in win32/Makefile, otherwise re-run
Configure and
answer yes on 'Perl can be built with experimental UTF8
filename semantics
enabled' question.
--
Sincerely,
Dmitry Karasik
|
|
|
| Re: proposed change in utf8 filename
semantics |

|
2007-09-18 18:58:37 |
One big problem with filenames and encodings is that it is
incredibly
platform dependent. And here, "platform" includes
mounted filesystem!
/foo may expect encoding A, whereas /foo/bar wants B. This
can result in
a single path of /foo/bar/baz requiring that "foo"
be encoded as latin1,
"bar" as A, and "baz" as B.
Unless perl can -somehow- tell (or be told) which encoding
is required,
there's really no way to get any cross platform
compatibility in this
area.
And note that while mixed filesystem encodings may only
occur
occassionally in real systems, there might still be the case
of dealing
with user preference, where the MP3 collection is UTF-8, but
the photo
album is strictly ISO-8859-1 for compatibility with some old
program
that adds captions to the images.
> 1. When a filename-related function is called with a
filename scalar
> that has SvUTF8 bit set, the function will try to use
OS-level unicode
> API -- if present.
No. The SvUTF8 bit indicates that the internal encoding of
the string is
UTF8 rather than ISO-8859-1. (Note that ISO-8859-1 is
-officially- a
Unicode encoding too, so Unicode semantics ought to apply.)
Perl already uses the UTF8 flag to decide *semantics* in
several places.
While from a historical perspective this may have made
sense, it is a
huge mistake that causes a lot of pain and subtle
hard-to-catch bugs.
Do not use the UTF8 flag to determine if you're going to use
Unicode
semantics or not, in new code. Use something that is visible
in Perl
code, instead of some internal variable. For example, a
pragma.
Support unicode always or never, or let the user decide.
Please do not
apply heuristics here.
> 2. Functions that return file names, like readdir(),
are taught to
> differentiate between bytes and utf8 context,
regardless of whether OS
> supports unicode API or not.
Instead of "bytes" and "utf8", please
let's make that "binary" and
"text", or "bytes" and
"characters", because UTF-8 sequences are also
bytes.
> I propose to extend syntax of binmode so that two new
calls
> binmode( DIRHANDLE, ':utf8')
> binmode( DIRHANDLE, ':bytes')
> will be recognized, and depending on the last such
call, readdir()
> will return filenames either with or without SvUTF8
flag on.
This does not scale to functions like glob and open that
don't act
on a DIRHANDLE, but do access directories. When you open
/foo/bar/baz/quux, each part can have its own expected
encoding, so you
need to be able to set different encodings for /foo,
/foo/bar,
/foo/bar/baz, and /foo/bar/baz/quux.
I think a (non-lexical) pragma or special variable that
enables encoding
support (not just UTF-8!) for the filesystem would be a
better idea.
When enabled, Perl tries to auto-detect the encoding, with
the ability
to override this by explicitly saying that things under
"/foo/bar/"
should be encoding B and everything under
"/mnt/tmp5" should be encoding
C. Of course, there should also be a way to say that even
though
"/foo/bar"'s tree was forced to B, everything
under "/foo/bar/baz/quux"
should use auto detection again.
> all results of PerlIO_readdir() will be simply flagged
with SvUTF8,
> and the validity of utf8 string can be later checked
with utf8::valid,
> if necessary.
That's a scary and potentially dangerous approach. SvUTF8 is
treated as
a promise that says "this buffer is valid UTF8".
This is why
:encoding(UTF-8) is often a better choice than :utf8. In
fact, I'm still
pissed off by the poor huffman coding here.
> I'm attaching a patch against 5.10.0 that implements
this new behavior for
> stat(), opendir(), and readdir() only. I'm unsure
whether this patch would be
> considered good enough for inclusion
I love it when people send patches. It shows that what they
want can
actually be done, and that they're willing to spend time to
make it
happen.
However, this is a new feature with potentially major
(probably
positive) impact. It lacks documentation and there's
practically no time
to test it. I like the idea of having Perl support
non-raw-bytes
filenames, but let's first find concensus about what the
proper level of
abstraction should be. I'm not the pumpking, of course, but
I'd like
this to go into 5.12, not 5.10.
> (I basically need to add a new config variable
utf8filenamesemantics).
There's more than just UTF-8. It would be nice if the
implementation
went all the way and implemented a framework for other
encodings too.
> simple 'Encode::_utf8_on($_) for readdir' is all that is
needed to
> treat filenames as unicode.
Simple but dangerous, and there's no way of knowing that
what readdir
returns should actually be interpreted as UTF-8. (AFAIK.)
--
Met vriendelijke groet, Kind regards, Korajn salutojn,
Juerd Waalboer: Perl hacker <##### juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy
<sales convolution.nl>
|
|
| RE: proposed change in utf8 filename
semantics |

|
2007-09-18 19:57:13 |
On Tue, 18 Sep 2007, Dmitry Karasik wrote:
> For some time I was wondering how it would be optimal
to incorporate
> utf8 in filenames in perl. The problem, as it seems to
me, is that
> unicode support in this regard is not orthogonal in
Perl, because
> things like locales and files IO can be easily managed
by 'use locale'
> and IO layers, whereas unicode characters in file names
are left out.
> For unix this was never a problem, because there is no
special syntax
> for filenames in unicode, but for win32 it is, and
working with files
> that contain unicode letters outside current locale is
a real trouble.
>
> I would therefore like to propose a (non-default)
change in semantics,
> that will use OS-level unicode API (for win32,
wide-char API) when
> avaialble, and if explicitly asked. The semantics has
two aspects:
I think this is the wrong approach. This topic has been
discussed a couple
of times, both here and on the perl-unicode mailing list.
The consensus
seems to be the approach described in pod/perltodo.pod under
the heading
"Virtualize operating system access".
I'm interested in discussing this further, but I'm going to
be offline
from sometime next week until the end of October. However,
I think this
is a topic for Perl 5.12, so there is no urgency right now.
Note that I added various workarounds to both Perl 5.10 and
the included
Win32 module to make it possible to work with Unicode
filenames that
cannot be mapped to the ANSI codepage:
Whenever readdir() or glob() have to return a filename that
cannot be
mapped back to the system codepage without substitution
characters, then
they will return the short 8.3 name instead. As long as you
are using
the NTFS filesystem, this name can always be represented in
the ANSI
codepage, and therefore be passed back to open() or passed
to other
programs etc. (If you are using FAT, then the 8.3 filename
may contain
characters from the OEM character set and may still not be
representable
in the ANSI codepage).
If you need the long version of a filename returned by
readdir() or glob()
then you can always call Win32::GetLongPathName(), which
will return
the full name of the file or directory, switching to UTF8 if
the string
cannot be represented in the ANSI codepage.
So while accessing non-ANSI filenames from Perl isn't
exactly easy, it
is certainly already possible, and mostly seamless, as long
as you are
using NTFS. Please check out the other filename related
functions in the
Win32.pm module and let me know what you think.
Just remember that using the 8.3 filenames is meant as a
workaround until
the "virtual operating system access" is properly
implemented. The basic
system is already in place for Win32; the problem is just
that we continue
to use char* pointers to pass strings to OS calls, so we
lose the string
encoding in the process.
Cheers,
-Jan
|
|
| RE: proposed change in utf8 filename
semantics |

|
2007-09-18 20:12:17 |
On Tue, 18 Sep 2007, Juerd Waalboer wrote:
> One big problem with filenames and encodings is that it
is incredibly
> platform dependent. And here, "platform"
includes mounted filesystem!
>
> /foo may expect encoding A, whereas /foo/bar wants B.
This can result in
> a single path of /foo/bar/baz requiring that
"foo" be encoded as latin1,
> "bar" as A, and "baz" as B.
>
> Unless perl can -somehow- tell (or be told) which
encoding is required,
> there's really no way to get any cross platform
compatibility in this
> area.
>
> And note that while mixed filesystem encodings may only
occur
> occassionally in real systems, there might still be the
case of dealing
> with user preference, where the MP3 collection is
UTF-8, but the photo
> album is strictly ISO-8859-1 for compatibility with
some old program
> that adds captions to the images.
On Windows you can just call the wide-character APIs and the
OS / file
system drivers make sure that each part of the filename is
encoded
correctly for the filesystem on which it is stored.
The whole issue only becomes messy when you have to use the
byte string
API, when you suddenly have to deal encodings that can only
represent a
subset of the full Unicode character set. The real problem
of course is
that typical Unix systems don't have a wide-character API
that hides
the implementation details from the user.
Cheers,
-Jan
|
|
| Re: proposed change in utf8 filename
semantics |

|
2007-09-18 22:44:42 |
Dmitry Karasik wrote:
> Hello,
>
> For some time I was wondering how it would be optimal
to incorporate utf8 in
> filenames in perl. The problem, as it seems to me, is
that unicode support in
> this regard is not orthogonal in Perl, because things
like locales and files IO
> can be easily managed by 'use locale' and IO layers,
whereas unicode characters
> in file names are left out. For unix this was never a
problem, because
> there is no special syntax for filenames in unicode,
but for win32 it is, and
> working with files that contain unicode letters outside
current locale is a
> real trouble.
It may be bigger problems than you realize.
What happens on NTFS if you store a filename using UTF8
encoding through
the traditional UNIX calls?
The VMS ODS-5 file system supports Unicode in two
encodings.
VTF-8 which is a special encoding of UCS-2 into ASCII-hex
characters
preceded by a special ASCII caret and the capital U.
And it also supports almost pure binary characters in
filenames with
only a few exceptions. The binary characters are encoded in
HEX
preceded by an ASCII caret.
Both of these are available using the native file system
syntax.
The C library routines on VMS can handle either native
syntax or UNIX
syntax. UTF-8 names in UNIX syntax are automatically
converted to the
binary native syntax and back by the C library.
There are no wide character APIs in the VMS C library.
Now here the problem really get bad. There are programs on
VMS that
only understand how to represent Unicode filenames in the
native
language using VTF-8 format.
Filenames encoded in the UTF-8 (binary) format are not
usable to those
programs.
Filenames encoded in VTF-7 format are not visible to
programs expecting
UNIX file syntax.
> I would therefore like to propose a (non-default)
change in semantics, that
> will use OS-level unicode API (for win32, wide-char
API) when avaialble, and if
> explicitly asked. The semantics has two aspects:
> 1. When a filename-related function is called with a
filename scalar that has
> SvUTF8 bit set, the function will try to use OS-level
unicode API -- if
> present. On win32, functions like win32_stat() will
check the utf8 context
> hints, and depending on the value, will call either
stat() or wstat(). For OSes
> where no special API is present, no changes in the code
is needed, and no
> addition runtime expenses are incurred.
Which Unicode API should VMS use when the SvUTF8 is set?
Native, UTF-8
UNIX, or VTF-7?
And strings can turn into file specifications with out
SvUTF8 being set.
> 2. Functions that return file names, like readdir(),
are taught to
> differentiate between bytes and utf8 context,
regardless of whether OS supports
> unicode API or not. I propose to extend syntax of
binmode so that two new
> calls
>
> binmode( DIRHANDLE, ':utf8')
> binmode( DIRHANDLE, ':bytes')
I can easily tell if I do a readdir() if the filename it is
reading is
VTF-7 encoded or not. I do not know for sure if it is utf-8
encoded.
Both may be present in the same directory.
> will be recognized, and depending on the last such
call, readdir() will return
> filenames either with or without SvUTF8 flag on. Again,
OS unicode API will be
> used where supported, and where it is not, no
additional code is required. In
> ':utf8' mode, all results of PerlIO_readdir() will be
simply flagged with
> SvUTF8, and the validity of utf8 string can be later
checked with utf8::valid,
> if necessary.
Again, the SvUTF8 flag can tell me that I can expect UTF-8
sequences,
but in NATIVE mode, a VMS file specification has either
Unicode
representations encoded in ASCII.
> I'm attaching a patch against 5.10.0 that implements
this new behavior for
> stat(), opendir(), and readdir() only. I'm unsure
whether this patch would be
> considered good enough for inclusion, so I don't want
to spend more time on
> implementing all filename-related functions yet. OTOH
if someone would want to
> help me with the implementation, that would be really
great.
First see if your platform will accept UTF-8 encoded
filenames in UNIX
syntax as different files from other Unicode encoded.
> The patch is split in two sections, one for the code
and another for the
> configuration files. The code patch concerns only .c
and .h files, and is
> fairly complete. I'm unsure though about the
configuration patch - it applies
> changes to Configure and win32/config.*, but there are
many more pre-complied
> config templates for other platforms, so I didn't touch
these, and would like
> to ask someone to tell me what did I miss ( I basically
need to add a new config
> variable utf8filenamesemantics).
I do not like build options, is there any way to make it a
run time
setting like a mode or a pragma.
> Of course I'm completely unsure if the idea with the
new utf8 filename
> semantics will be accepted at all. I understand that it
is win32 users that
> will benefit most from it, because on unix simple
'Encode::_utf8_on($_) for
> readdir' is all that is needed to treat filenames as
unicode. Nevertheless,
> if accepted, there will be some little more uniformity
in Perl's cross-platform
> filename and unicode handing.
I think that the issue of filename handling for non-UNIX
platforms could
use some improvements.
I also think that VMS and Win32 may have some issues in
common that need
to be resolved.
It may be more practical to first build an external overload
to the
filename functions to do the translations based on an object
type of a
file specification, and the properties of that object.
This way perl modules can use class of a file specification
and its
properties as an enhancement to the base perl, and it would
be clear
that the object in question is a file specification.
> This is my first Perl patch, so if I broke some rules
here, please don't just
> stay silent, tell me what can be done better. I tested
it on win32 and freebsd
> and linux, seems to be working as expected, I don't
know what else should I
> test. Please review and/or test it too. To enable it
for win32, define
> UTF8_FILENAME_SEMANTICS in win32/Makefile, otherwise
re-run Configure and
> answer yes on 'Perl can be built with experimental UTF8
filename semantics
> enabled' question.
For VMS, what would be needed is a wrapper around any UNIX
routine that
operates on a filename that could translate the filename
from UNIX
format to Native if needed, and then it would need options
as to how to
translate it.
This is why I like the idea of an class for handing file
specifications,
Unicode or otherwise.
Even more fun would be if someone needed to write a Perl
script to
rename UTF-8 encoded names to VTF-7 encoded names or the
reverse. It
might be to maintain hard links between the two encodings.
VMS also has a related issue in that filenames may need to
be translated
to native format for spawned shell commands. And these
names need to
be less than 255 characters long, even if the Unix syntax
name is longer.
It seems to me that was is needed is a set of APIs that can
convert from
UNIX syntax to the native syntax and back, with a defined
behavior of
what to do when the conversion is not possible.
Something like File::OPS->to_UNIX,
File::OPS->to_native and
File::OPS->to_native_short and their equivalent for
calling from C
programs might eliminate some of the OS specific code.
These routines may need additional hints to indicate if they
are working
on directories, and if the file should exist, because if the
file
exists, it may need to check all possible encoding of the
name. For
instance, the name 'foo.bar.baz' may show up on VMS as
'foo.bar.baz',
'foo^.bar^.baz', 'foo_bar.baz', 'foo.bar_baz',
'foo__2Ebar.baz' or
'foo.bar__2Ebaz' depending on what utility placed the file
there.
Right now, I am working on getting the VMS ODS-5 handling
working right
in Perl, and that includes the various ways that Unicode can
be put in.
-John
wb8tyw qsl.net
Personal Opinion Only
|
|
| Re: proposed change in utf8 filename
semantics |

|
2007-09-19 02:13:16 |
Hi Juerd!
Juerd> Unless perl can -somehow- tell (or be told) which
encoding is
Juerd> required, there's really no way to get any cross
platform
Juerd> compatibility in this area.
Of course. The idea is that it is the caller that tells perl
which
encoding is required, so no heuristics is necessary.
readdir() would
therefore return unicode filenames only after being told to
do so.
Juerd> And note that while mixed filesystem encodings
may only occur
Juerd> occassionally in real systems, there might still
be the case of
Juerd> dealing with user preference, where the MP3
collection is UTF-8,
Juerd> but the photo album is strictly ISO-8859-1 for
compatibility with
Juerd> some old program that adds captions to the
images.
If we're talking about unix mounts, that is a non-issue. For
win32
mounts, I don't know, I never encountered them at all, so I
don't know
if win32 API takes care of the underlying encoding
translations.
Juerd> Perl already uses the UTF8 flag to decide
*semantics* in several
Juerd> places. While from a historical perspective this
may have made
Juerd> sense, it is a huge mistake that causes a lot of
pain and subtle
Juerd> hard-to-catch bugs.
Hm. I was unaware of a point of view that the UTF8 flag was
a mistake,
so of course from that point of view the whole proposition
would be a
continuation of that mistake, simply put.
Juerd> Do not use the UTF8 flag to determine if you're
going to use
Juerd> Unicode semantics or not, in new code. Use
something that is
Juerd> visible in Perl code, instead of some internal
variable. For
Juerd> example, a pragma.
I tend to agree, however pragmas tend to be global, program-
or package-
wise, and what suits best here is individual, perl-call
flag.
Juerd> Support unicode always or never, or let the user
decide. Please do
Juerd> not apply heuristics here.
I must've written something unclear. There's no heuristics,
and it is
user that decides which semantics to use.
Juerd> Instead of "bytes" and
"utf8", please let's make that "binary"
and
Juerd> "text", or "bytes" and
"characters", because UTF-8 sequences are
Juerd> also bytes.
I personally don't really care what names would be, I just
thought that
it would better fit with the existing IO layer names, with
binmode(FILE, ':utf8') and the like.
Juerd> This does not scale to functions like glob and
open that don't act
Juerd> on a DIRHANDLE, but do access directories. When
you open
Juerd> /foo/bar/baz/quux, each part can have its own
expected encoding, so
Juerd> you need to be able to set different encodings
for /foo, /foo/bar,
Juerd> /foo/bar/baz, and /foo/bar/baz/quux.
This is true for glob, but untrue for open, -- the latter
does not return
filenames.
>> all results of PerlIO_readdir() will be simply
flagged with SvUTF8, and
>> the validity of utf8 string can be later checked
with utf8::valid, if
>> necessary.
Juerd> That's a scary and potentially dangerous
approach. SvUTF8 is
Juerd> treated as a promise that says "this buffer
is valid UTF8". This is
Juerd> why :encoding(UTF-8) is often a better choice
than :utf8. In fact,
Juerd> I'm still pissed off by the poor huffman coding
here.
This is also a bit unclear to me. utf8::valid happily
returns true
when SvUTF8 if off, and only when it is on it does the
actual validity
check. It would be trivial to enforce the promise that
scalars that are
flagged with SvUTF8 are really valid, however, the proposed
behavior is
also based on behavior of the utf8 IO layer, that simply
flags all input
with SvUTF8, valid or not. So this behavior is debatable.
Juerd> Simple but dangerous, and there's no way of
knowing that what
Juerd> readdir returns should actually be interpreted as
UTF-8. (AFAIK.)
True, but again, lets look at utf8 IO layer: whoever uses
that, accepts the
burden of checking the validity of input. Same is for
readdir.
--
Sincerely,
Dmitry Karasik
|
|
| Re: proposed change in utf8 filename
semantics |

|
2007-09-19 02:25:55 |
Jan> I'm interested in discussing this further, but I'm
going to be
Jan> offline from sometime next week until the end of
October. However, I
Jan> think this is a topic for Perl 5.12, so there is no
urgency right
Jan> now.
Sure thing. That's why there's only a minimal patch.
Jan> Whenever readdir() or glob() have to return a
filename that cannot be
Jan> mapped back to the system codepage without
substitution characters,
Jan> then they will return the short 8.3 name instead.
I'm aware of that, but take note that filenames should not
necessarily
come to perl program using readdir and glob. If f.ex. a user
types
in filename that contains the unmappable characters, open()
wouldn't be able
to open that file.
Jan> So while accessing non-ANSI filenames from Perl
isn't exactly easy,
Jan> it is certainly already possible, and mostly
seamless, as long as you
Jan> are using NTFS. Please check out the other filename
related functions
Jan> in the Win32.pm module and let me know what you
think.
I'm also very much aware of Win32 wide filename support, but
my point is that if a
good abstraction of unicode in filenames is found (and I
hope that mine is
good), then all that wide filename support can be moved to
core.
Jan> Just remember that using the 8.3 filenames is meant
as a workaround
Jan> until the "virtual operating system
access" is properly implemented.
I have no optinion about "virtual operating system
access", and especially
on when it will be implemented, but when it is, there still
must be
change in perl-level semantics anyway. Let's test that
semantics on the
proposed implementation now, and if it is good, it can be
taken as a point
of reference when implementing virtual OS access.
Jan> The basic system is already in place for Win32; the
problem is just
Jan> that we continue to use char* pointers to pass
strings to OS calls,
Jan> so we lose the string encoding in the process.
Not necessarily -- if we adopt set of flags passed in
PL_dir_unicode, then
we're just fine with char* pointers.
--
Sincerely,
Dmitry Karasik
|
|
| Re: proposed change in utf8 filename
semantics |

|
2007-09-19 02:58:56 |
Hi John!
John> What happens on NTFS if you store a filename using
UTF8 encoding
John> through the traditional UNIX calls?
The default behavior stays the same -- filenames will be
stored using
byte semantics. If, however, a filename scalar has SvUTF8
flag set,
UTF8 semantics will be used.
John> Now here the problem really get bad. There are
programs on VMS that
John> only understand how to represent Unicode filenames
in the native
John> language using VTF-8 format.
John> Filenames encoded in the UTF-8 (binary) format are
not usable to
John> those programs.
John> Filenames encoded in VTF-7 format are not visible
to programs
John> expecting UNIX file syntax.
I have verly little knowledge about VMS, but based on what
you say, there
is no way to resolve the conflict, as presented.
John> Which Unicode API should VMS use when the SvUTF8
is set? Native,
John> UTF-8 UNIX, or VTF-7?
I wish I would be able to answer your question, because VMS
support will
be vital for my proposition, however as I'm not competent
here, I simply don't
know, sorry.
John> I can easily tell if I do a readdir() if the
filename it is reading
John> is VTF-7 encoded or not. I do not know for sure
if it is utf-8
John> encoded.
But same is for default unix semantics -- results() from
readdir can be
easily anaylzed whether they are utf8 or not. The idea is
not to analyze
the output at all, but rather lete the users decide what
encoding they
want they filenames in.
John> First see if your platform will accept UTF-8
encoded filenames in
John> UNIX syntax as different files from other Unicode
encoded.
Possibly. Again, this is VMS-related, so I don't know.
John> I do not like build options, is there any way to
make it a run time
John> setting like a mode or a pragma.
Me neither, and I guess there should be no problem adding
such a pragma.
John> It may be more practical to first build an
external overload to the
John> filename functions to do the translations based on
an object type of
John> a file specification, and the properties of that
object.
John> This way perl modules can use class of a file
specification and its
John> properties as an enhancement to the base perl, and
it would be clear
John> that the object in question is a file
specification.
This is interesting. Possibly I'm doing a premature
optimization, because
the actual changes required in system-independent files are
minor, so I
though that direct changes to core that don't change
anything on unix
would be good enough.
John> Even more fun would be if someone needed to write
a Perl script to
John> rename UTF-8 encoded names to VTF-7 encoded names
or the reverse.
John> It might be to maintain hard links between the two
encodings.
Would it be ( excuse my VMS ignorance ) more appropriate to
treat SvUTF8
flag as indication of which layer to use? So SvUTF8-flagged
scalars that
contain characters > 0x7f will be put through VTF-7
layer, and unix-layer
otherwise? But I see that just one SvUTF8 flag might not be
enough here.
--
Sincerely,
Dmitry Karasik
|
|
| Re: proposed change in utf8 filename
semantics |

|
2007-09-19 05:16:38 |
Dmitry Karasik skribis 2007-09-19 9:13 (+0200):
> Juerd> Perl already uses the UTF8 flag to decide
*semantics* in several
> Juerd> places. While from a historical perspective
this may have made
> Juerd> sense, it is a huge mistake that causes a
lot of pain and subtle
> Juerd> hard-to-catch bugs.
> Hm. I was unaware of a point of view that the UTF8 flag
was a mistake,
Not the UTF8 flag was a mistake, but using it as a heuristic
for
semantics was. In essence, the UTF8 flag indicates that the
string is
internally raw bytes, or UTF8 encoded. A raw byte string is
interpreted
as ISO-8859-1 whenever it needs to be upgraded. However,
with lc, uc,
//i, and character classes, a negative UTF8 indication
results in ASCII
semantics, ignoring the second half of ISO-8859-1
altogether.
This is wrong, because from the programmer's perspective,
you can now
have $foo eq $bar, while $foo =~ /w/ and $bar !~ /w/.
Abstraction is
broken.
> Juerd> Do not use the UTF8 flag to determine if
you're going to use
> Juerd> Unicode semantics or not, in new code. Use
something that is
> Juerd> visible in Perl code, instead of some
internal variable. For
> Juerd> example, a pragma.
> I tend to agree, however pragmas tend to be global,
program- or package-
> wise, and what suits best here is individual, perl-call
flag.
Global is a problem in most cases, but I feel it would be
perfect here,
simply because the filesystem is equally global. In fact,
it's even
longer lived than your Perl program
Better yet, global variables can be localized to dynamic
scope. This is
good, because when you set the encoding for /foo, it should
work for
encoding-unaware modules too.
Maybe a hash would be nice:
${^FS_ENCODING} = 'A';
${^FS_ENCODING} = 'B';
${^FS_ENCODING} = 'auto';
open my $fh, ">",
"/foo/bar/baz/quux/blah/hello.txt";
Which then actually does:
open my $fh, ">", join("/",
""
encode(detect_encoding("/"),
"foo"),
encode("A", "bar"),
encode("B", "baz"),
encode("B", "quux"),
encode(detect_encoding("/foo/bar/baz/quux"),
"blah"),
encode(detect_encoding("/foo/bar/baz/quux/blah"),
"hello.txt"),
);
Like most things, this would only work if all encodings are
ASCII
compatible. (For the "/" separator)
> Juerd> Support unicode always or never, or let the
user decide. Please do
> Juerd> not apply heuristics here.
> I must've written something unclear. There's no
heuristics, and it is
> user that decides which semantics to use.
Using the UTF8 flag for that would have been a heuristic.
> Juerd> This does not scale to functions like glob
and open that don't act
> Juerd> on a DIRHANDLE, but do access directories.
When you open
> Juerd> /foo/bar/baz/quux, each part can have its
own expected encoding, so
> Juerd> you need to be able to set different
encodings for /foo, /foo/bar,
> Juerd> /foo/bar/baz, and /foo/bar/baz/quux.
> This is true for glob, but untrue for open, -- the
latter does not return
> filenames.
It does not return them, but it does use them. It has to
encode paths
with the same encodings that readdir uses to decode them, or
symmetry is
broken and the result of readdir is now useless.
> >> all results of PerlIO_readdir() will be
simply flagged with SvUTF8, and
> >> the validity of utf8 string can be later
checked with utf8::valid, if
> >> necessary.
> Juerd> That's a scary and potentially dangerous
approach. SvUTF8 is
> Juerd> treated as a promise that says "this
buffer is valid UTF8". This is
> Juerd> why :encoding(UTF-8) is often a better
choice than :utf8. In fact,
> Juerd> I'm still pissed off by the poor huffman
coding here.
> This is also a bit unclear to me.
The responsibility for checking the value should be perl's,
not the
programmer's.
> that simply flags all input with SvUTF8, valid or not.
Simply flagging is arguably wrong and dangerous. Instead of
simply
flagging, the string should be decoded properly. This may
result in
exactly the same byte sequence, but provides important
checks.
> Juerd> Simple but dangerous, and there's no way of
knowing that what
> Juerd> readdir returns should actually be
interpreted as UTF-8. (AFAIK.)
> True, but again, lets look at utf8 IO layer: whoever
uses that, accepts the
> burden of checking the validity of input. Same is for
readdir.
Yes, with no documentation whatsoever pointing out the
danger. I'm
looking for tuits to fix this. ":utf8" is used
MUCH too easily, because
people DO NOT KNOW that they then have to check for validity
themselves.
It's fine when writing (encoding), it's bad when reading
(decoding).
--
Met vriendelijke groet, Kind regards, Korajn salutojn,
Juerd Waalboer: Perl hacker <##### juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy
<sales convolution.nl>
|
|
| Re: proposed change in utf8 filename
semantics |

|
2007-09-19 09:01:27 |
Dmitry Karasik wrote:
> Hi John!
>
> John> What happens on NTFS if you store a filename
using UTF8 encoding
> John> through the traditional UNIX calls?
>
> The default behavior stays the same -- filenames will
be stored using
> byte semantics. If, however, a filename scalar has
SvUTF8 flag set,
> UTF8 semantics will be used.
I was referring to outside of Perl.
The VMS ODS-5 file system was developed for use with
Pathworks/Advanced
Server to serve files to Microsoft Windows. Pathworks used
code
licensed from Microsoft through ATT. So ODS-5 was designed
to have a
filename behavior similar to NTFS, with out the support for
8 by 3 names.
So there is a possibility that any issues that VMS has with
oddities in
Unicode handling, Microsoft Windows may also have the same
on NTFS.
> John> Now here the problem really get bad. There
are programs on VMS that
> John> only understand how to represent Unicode
filenames in the native
> John> language using VTF-8 format.
> John> Filenames encoded in the UTF-8 (binary)
format are not usable to
> John> those programs.
>
> John> Filenames encoded in VTF-7 format are not
visible to programs
> John> expecting UNIX file syntax.
>
> I have verly little knowledge about VMS, but based on
what you say, there
> is no way to resolve the conflict, as presented.
It requires the user or system administrator to set a flag
indicating
what mode to use. An enhancement request has been filed for
the VMS C
library to have such a flag. I do not know what the status
of that
request is and do not have any direct way to find out. In
any case, the
C library change would only help for future versions of
VMS.
> John> Which Unicode API should VMS use when the
SvUTF8 is set? Native,
> John> UTF-8 UNIX, or VTF-7?
>
> I wish I would be able to answer your question, because
VMS support will
> be vital for my proposition, however as I'm not
competent here, I simply don't
> know, sorry.
Right now, Perl does not fully support VMS ODS-5 for
non-Unicode
filenames, and I need to get that working before I can look
at adding
Unicode support. The fact that I also have not really
worked with
Unicode is also a hindrance as I do not have any independent
test cases
to verify if I get things write.
Latent in the VMS port of Perl, it looks for an external
flag to
determine if it should convert UNIX UTF-8 to VTF-7 or pass
it through.
Some of the VTF-7 handling is now present, but it is
untested.
> John> I can easily tell if I do a readdir() if the
filename it is reading
> John> is VTF-7 encoded or not. I do not know for
sure if it is utf-8
> John> encoded.
>
> But same is for default unix semantics -- results()
from readdir can be
> easily anaylzed whether they are utf8 or not. The idea
is not to analyze
> the output at all, but rather lete the users decide
what encoding they
> want they filenames in.
Realize that the user may not know or want to care about
filename encodings.
With UNIX it is not an issue because everything on the
system treats a
filename the same way, regardless of the encoding.
With VMS, it is an issue because there is a traditional
native syntax, a
UNIX translation of that syntax, an extended native syntax,
and that
extended native syntax requires changes to the UNIX
translation.
> John> First see if your platform will accept UTF-8
encoded filenames in
> John> UNIX syntax as different files from other
Unicode encoded.
>
> Possibly. Again, this is VMS-related, so I don't know.
No it is related to the platforms that you are using. What
you need to
do is a simple test:
Create a file name using characters that require Unicode
encoding.
Create a UTF-8 representation of that filename and
create a file
with that name in an empty directory.
Create the wide (UCS-2) representation of the above file
name.
Use the wide open routine to try to open the existing
file that
that you just created.
If that step succeeds, then it means that your platform
treats UTF-8 and
UCS-2 representations as the same filename transparently,
and it means
that much if any of your hacks are not needed.
If that step fails, then you have the exact same issue as
VMS, where
UTF-8 filenames and "wide" filenames are treated
as different files, and
that the same special handling is needed to know if a file
name string
with the SvUTF8 flag needs to be passed through as binary or
converted
to "wide" for use with a "wide" call.
And in the case that the step fails, then you need guidance
from
external to the program as to how to handle the UTF-8 code.
> John> I do not like build options, is there any way
to make it a run time
> John> setting like a mode or a pragma.
>
> Me neither, and I guess there should be no problem
adding such a pragma.
>
> John> It may be more practical to first build an
external overload to the
> John> filename functions to do the translations
based on an object type of
> John> a file specification, and the properties of
that object.
> John> This way perl modules can use class of a file
specification and its
> John> properties as an enhancement to the base
perl, and it would be clear
> John> that the object in question is a file
specification.
>
> This is interesting. Possibly I'm doing a premature
optimization, because
> the actual changes required in system-independent files
are minor, so I
> though that direct changes to core that don't change
anything on unix
> would be good enough.
>
> John> Even more fun would be if someone needed to
write a Perl script to
> John> rename UTF-8 encoded names to VTF-7 encoded
names or the reverse.
> John> It might be to maintain hard links between
the two encodings.
>
> Would it be ( excuse my VMS ignorance ) more
appropriate to treat SvUTF8
> flag as indication of which layer to use? So
SvUTF8-flagged scalars that
> contain characters > 0x7f will be put through VTF-7
layer, and unix-layer
> otherwise? But I see that just one SvUTF8 flag might
not be enough here.
SvUTF8 is a binary flag. I need a flag to indicate how I
should
translate UTF-8 encoded file names to native VMS file
names.
And I can not trust the UTF-8 flag to have been set, because
things like
File::Spec and VMS::Filespec currently do not appear to deal
with it and
may strip it off of a processed or created file
specification.
That is why the solution may be to create a class to handle
filenames
and file systems with methods and properties that are unique
to them.
-John
wb8tyw qsl.net
Personal Opinion Only
|
|
|
|