List Info

Thread: parsing in eval() varies with UTF8ness




parsing in eval() varies with UTF8ness
user name
2007-09-22 16:55:20
# New Ticket Created by  Zefram 
# Please include the string:  [perl #45673]
# in the subject line of all future correspondence about
this issue. 
# <URL: h
ttp://rt.perl.org/rt3/Ticket/Display.html?id=45673 >



This is a bug report for perl from zeframfysh.org,
generated with the help of perlbug 1.35 running under perl
v5.8.8.


------------------------------------------------------------
-----
[Please enter your report here]

$ perl -we '$a="require xxy::z"; eval $a;
print $'
Warning: Use of "require" without parentheses is
ambiguous at (eval 1) line 1.
Unrecognized character xF1 at (eval 1) line 1.
$ perl -we '$a="require xxy::z";
utf8::upgrade($a); eval $a; print $'
Can't locate xZZy/z.pm in INC (INC contains: /etc/perl
/usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8
/usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8
/usr/share/perl/5.8 /usr/local/lib/site_perl
/usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 .) at
(eval 1) line 3.
$

What I show above as "ZZ" was originally a
sequence of two non-ASCII
characters: U+00c3 (Latin capital letter A with tilde) and
U+00b1
(plus-minus sign).  I've replaced them with ASCII characters
to avoid
unpredictable manglement.

The phenomenon we see here is that the syntax of Perl, as
judged by
eval(), varies according to whether the input string is
physically
encoded in UTF8.  If it is so encoded then U+00f1, Latin
small letter N
with tilde, is an acceptable identifier character, and so
can be part
of a module name.  If not, then the very same character is
invalid in
that context and causes a syntax error.

What, exactly, is Perl's identifier syntax?  Is U+00f1 a
valid identifier
character?

[Please do not change anything below this line]
------------------------------------------------------------
-----
---
Flags:
    category=core
    severity=low
---
Site configuration information for perl v5.8.8:

Configured by Debian Project at Wed Dec  6 23:17:41 UTC
2006.

Summary of my perl5 (revision 5 version 8 subversion 8)
configuration:
  Platform:
    osname=linux, osvers=2.6.18.3,
archname=i486-linux-gnu-thread-multi
    uname='linux saens 2.6.18.3 #1 smp sat nov 25 13:39:52
est 2006 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles
-Dccflags=-DDEBIAN -Dcccdlflags=-fPIC
-Darchname=i486-linux-gnu -Dprefix=/usr
-Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl/5.8.8
-Dsitearch=/usr/local/lib/perl/5.8.8
-Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3
-Dsiteman1dir=/usr/local/man/man1
-Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1
-Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs
-Ud_csh -Uusesfio -Uusenm -Duseshrplib
-Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef
useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define
usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE
-DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe
-I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS
-DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.1.2 20061115 (prerelease)
(Debian 4.1.1-20)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8,
byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8,
Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc
-lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.3.6.so, so=so, useshrplib=true,
libperl=libperl.so.5.8.8
    gnulibc_version='2.3.6'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef,
ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared
-L/usr/local/lib'

Locally applied patches:
    

---
INC
for perl v5.8.8:
    /etc/perl
    /usr/local/lib/perl/5.8.8
    /usr/local/share/perl/5.8.8
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.8
    /usr/share/perl/5.8
    /usr/local/lib/site_perl
    /usr/local/lib/perl/5.8.4
    /usr/local/share/perl/5.8.4
    .

---
Environment for perl v5.8.8:
    HOME=/home/zefram
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
   
PATH=/home/zefram/pub/i686-pc-linux-gnu/bin:/home/zefram/pub
/common/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local/bin:/usr
/games
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh


Re: parsing in eval() varies with UTF8ness
user name
2007-09-23 11:16:27
MOIN,

ON SATURDAY 22 SEPTEMBER 2007 23:55:20 ZEFRAM WROTE:
> # NEW TICKET CREATED BY  ZEFRAM
[SNIP]
>
> $ PERL -WE '$A="REQUIRE XXY::Z"; EVAL
$A; PRINT $'
> WARNING: USE OF "REQUIRE" WITHOUT PARENTHESES
IS AMBIGUOUS AT (EVAL 1)
> LINE 1. UNRECOGNIZED CHARACTER XF1 AT (EVAL 1) LINE
1.
> $ PERL -WE '$A="REQUIRE XXY::Z";
UTF8::UPGRADE($A); EVAL $A; PRINT
> $' CAN'T LOCATE XZZY/Z.PM IN INC (INC CONTAINS: /ETC/PERL
> /USR/LOCAL/LIB/PERL/5.8.8 /USR/LOCAL/SHARE/PERL/5.8.8
/USR/LIB/PERL5
> /USR/SHARE/PERL5 /USR/LIB/PERL/5.8 /USR/SHARE/PERL/5.8
> /USR/LOCAL/LIB/SITE_PERL /USR/LOCAL/LIB/PERL/5.8.4
> /USR/LOCAL/SHARE/PERL/5.8.4 .) AT (EVAL 1) LINE 3. $
>
> WHAT I SHOW ABOVE AS "ZZ" WAS ORIGINALLY A
SEQUENCE OF TWO NON-ASCII
> CHARACTERS: U+00C3 (LATIN CAPITAL LETTER A WITH TILDE)
AND U+00B1
> (PLUS-MINUS SIGN).  I'VE REPLACED THEM WITH ASCII
CHARACTERS TO AVOID
> UNPREDICTABLE MANGLEMENT.

THE SEQUENCE C3B1 IS UTF-8 FOR "CHARACTER 0XF1" SO
THAT IS RIGHT.

> THE PHENOMENON WE SEE HERE IS THAT THE SYNTAX OF PERL,
AS JUDGED BY
> EVAL(), VARIES ACCORDING TO WHETHER THE INPUT STRING IS
PHYSICALLY
> ENCODED IN UTF8.  IF IT IS SO ENCODED THEN U+00F1,
LATIN SMALL LETTER N
> WITH TILDE, IS AN ACCEPTABLE IDENTIFIER CHARACTER, AND
SO CAN BE PART
> OF A MODULE NAME.  IF NOT, THEN THE VERY SAME CHARACTER
IS INVALID IN
> THAT CONTEXT AND CAUSES A SYNTAX ERROR.
>
> WHAT, EXACTLY, IS PERL'S IDENTIFIER SYNTAX?  IS U+00F1
A VALID IDENTIFIER
> CHARACTER?

WHEN YOU DON'T DO "USE UTF8;" YOU SCRIPT IS
EXPECTED TO BE IN LATIN1 
(ISO.-8859-1). (WE LEAVE "USE LOCALE" OUT OF THIS
FOR NOW). UNDER USE UTF8, 
IT CAN CONTAIN ANY UTF-8.

HOWEVER, IT SEEMS EVAL() (OR REQUIRE?) DOESN'T KNOW ABOUT
THIS. PLUS, I AM 
NOT ENTIRELY SURE HOW MUCH UNICODE YOU CAN USE IN
IDENTIFIERS AS SOMETHING 
LIKE THIS:

	#!PERL
	USE UTF8;
	MY $ = 1;

STILL FAILS TO COMPILE WITH:

	UNRECOGNIZED CHARACTER X82 AT T.PL LINE 5.

PERLDOC PERLSYN (IN 5.8.8) DOESN'T SEEM TO SAY ANYTHING
ABOUT IDENTIFIERS.

PERLDOC  UTF8 SAYS:

       ENABLING THE "UTF8" PRAGMA HAS THE
FOLLOWING EFFECT:

	BYTES IN THE SOURCE TEXT THAT HAVE THEIR HIGHBIT SET WILL
BE
	TREATED AS BEING PART OF A LITERAL UTF8 CHARACTER.  THIS
	INCLUDES MOST LITERALS SUCH AS IDENTIFIER NAMES, STRING
	CONSTANTS, AND CONSTANT REGULAR EXPRESSION PATTERNS.

BUT IT DOESN'T SEEM TO WORK IN V5.8.8 AT LEAST.

ALL THE BEST,

TELS


-- 
 SIGNED ON SUN SEP 23 18:05:15 2007 WITH KEY 0X93B84C15.
 GET ONE OF MY PHOTO POSTERS: HTTP://BLOODGATE.COM/POSTERS
 PGP KEY ON HTTP://BLOODGATE.COM/TELS.ASC OR PER EMAIL.

 "SPAMMED IF YOU DO, SPAMMED IF YOU DON'T."

  -- MURPHY'S LAW
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )