List Info

Thread: Re: Unicode regex confusion




Re: Unicode regex confusion
user name
2007-02-26 08:40:33
On Mon, 26 Feb 2007 09:14:46 +0100, demerphq  wrote

> So the question is how does the toker handle this? How
does it know
> what encoding the source is, and given that it does
know how can we
> transmit this information to the regex engine? Or is it
done by a
> heuristic that we can copy? IOW, "if you see a
high bit octect and it
> isnt part of  a valid utf8 sequence it should be
upgraded?"

The toker upgrades high bit octets if need.
When scan_const() parses "ßx" in latin 1 (or
any encoding
where "ß" is represented by an octet, not by two),
it stores "ß"
as an octet first, then upgrades it into utf8 on finding
"x".

Compare scan_const() in toke.c
2138:    case 'x':

2164:      NUM_ESCAPE_INSERT:

2173:	if (!UNI_IS_INVARIANT(NATIVE_TO_UNI(uv))) {
2174:	    if (!has_utf8 && uv > 255) {
2175:	        /* Might need to recode whatever we have
2176:		 * accumulated so far if it contains any
2177:		 * hibit chars.
2178:		 *
2179:		 * (Can't we keep track of that and avoid
2180:		 *  this rescan? --jhi)
2181:		 */
2182:			int hicount = 0;
2183:		U8 *c;
2184:		for (c = (U8 *) SvPVX(sv); c < (U8 *)d; c++) {
2185:		    if (!NATIVE_IS_INVARIANT(*c)) {
2186:		        hicount++;
2187:		    }
2188:		}
2189:		if (hicount) {
2190:		    const STRLEN offset = d - SvPVX_const(sv);
          (the code continuing to line 2208.)
2208:       }

and reg_atom() in regcomp.c
6827:	    case 'x':
6828:		if (*++p == '{') {
6829:		    char* const e = strchr(p, '}');
6830:	
6831:		    if (!e) {
6832:			RExC_parse = p + 1;
6833:			vFAIL("Missing right brace on \x{}");
6834:		    }
6835:		    else {
6836:                   I32 flags =
PERL_SCAN_ALLOW_UNDERSCORES
6837:                             |
PERL_SCAN_DISALLOW_PREFIX;
6838:                   STRLEN numlen = e - p - 1;
6839:			ender = grok_hex(p + 1, &numlen, &flags,
NULL);
6840:			if (ender > 0xff)
6841:			    RExC_utf8 = 1;
6842:			p = e + 1;

Perhaps reg_atom() should do more hard work when ender >
0xff.
Just turning a flag on seems too lazy.
However scan_const() stores the result in a single SV,
then it can upgrade the already-parsed part together.

Could the regcomp upgrades high bit characters in regnodes
already created? Or should it rescan once more from the
beginning
of the pattern to create regnodes again but now matching
utf8...?

Regards,
SADAHIRO Tomoyuki



[1]

about | contact  Other archives ( Real Estate discussion Medical topics )