[written before Matz's additional answer]
At 18:35 08/05/28, Yukihiro Matsumoto wrote:
>Hi,
>
>In message "Re: Oniguruma and p"
> on Wed, 28 May 2008 18:10:24 +0900, ts
<decoux moulon.inra.fr> writes:
>
>|Martin Duerst wrote:
>|> This used to work, but the last time I checked
>|> was several months ago.
>|
>| I know nothing about this thing (encoding)
>| but is this normal ?
>
>It worked as we designed. But this case, we lost script
encoding.
>Not good.
I'm not sure this is a script encoding problem.
The string is in UTF-8. The regexp is, on the surface,
ASCII-only.
But in meaning, it contains something more than ASCII.
Wouldn't it work if an ASCII regexp applied to some
'more-than-just-ASCII'
string, would automatically be upgraded to the encoding of
that string?
Or even moret, aren't things such as p independent
of the
encoding of the regexp? Even in a script written in EUC-JP,
I can immagine having a UTF-8 string, and doing /p
should
try to find Greek *as encoding in the string*, not as
encoded in
the regexp. This may be easy to do or not depending on the
actual implementation, but in terms of object-oriented
thinking,
it definitely should work: The regexp asks the string, at
the
right point: do you have a Greek character here. It's up to
the string to know (or not) what a Greek character is.
This is different from matching of literally encoded
characters,
because there the implementation matches the bytes (while,
for
some encodings, it has to also take care of character
boundaries).
Regards, Martin.
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama
Gakuin University
#-#-# http://www.sw.it.aoyama
.ac.jp mailto:duerst it.aoyama.ac.jp
|