|
List Info
Thread: Re: Unicode regex confusion
|
|
| Re: Unicode regex confusion |

|
2007-02-26 10:46:51 |
|
| MOIN,
DEMERPHQ GMAIL.COM> WROTE:
>ON 2/26/07, DAVE MITCHELL IABYN.COM> WROTE:
[SNIP]
>> IF WE LEAVE IT TO THE ENGINE TO HANDLE X{..} EXPANSION, THEN IT AUGHT TO
>> UPGRADE ANY CHR(128)..CHR(255)S IN A SIMILAR MANNER TO THE WAY THE
>> DOUBLE-QUOETS STRING SCANNER DOES. I GUESS. ER MAYBE.
>WELL, THE ORIGINAL SNIPPET WILL WORK IF ITS SAVED AS UTF8. WHICH I
>TAKE TO MEAN THAT THE REGEX ENGINE CURRENTLY CAN'T PROPERLY UPGRADE
>CHR(128)..CHR(255) AS IT HAS NO WAY TO KNOW IF IT IS TRUELY A
>CORRUPTED UNICODE STRING OR A TRUE HIGH-BIT OCTECT.
>SO THE QUESTION IS HOW DOES THE TOKER HANDLE THIS? HOW DOES IT KNOW
>WHAT ENCODING THE SOURCE IS, AND GIVEN THAT IT DOES KNOW HOW CAN WE
>TRANSMIT THIS INFORMATION TO THE REGEX ENGINE? OR IS IT DONE BY A
>HEURISTIC THAT WE CAN COPY? IOW, "IF YOU SEE A HIGH BIT OCTECT AND IT
>ISNT PART OF A VALID UTF8 SEQUENCE IT SHOULD BE UPGRADED?"
>
>LOOKING INTO THIS FURTHER THE UTF8.PL SCRIPT IS REPRESENTED ON DISK AS
>THE FOLLOWING OCTECTS:
>
>>OD -T X1 UTF8.PL
>0000000 EF BB BF 2F C3 9F 7C 5C 78 7B 32 35 36 7D 2F
>0000017
>
>WHERAS THE LATIN1.PL SCRIPT LOOKS LIKE:
>
>>OD -T X1 LATIN1.PL
>0000000 2F DF 7C 5C 78 7B 32 35 36 7D 2F
>0000013
>
>CHECKING INTO IT "EF BB BF" IS THE BOM MARK FOR UTF8 WHICH SAYS TO ME
>THAT THE TOKER A) UNDERSTANDS BOM MARKS, AND B) IS NOT TELLING THE
>REGEX ENGINE WHETHER THE CONSTANT PART OF THE PATTERN IS UTF8 OR NOT.
ACTUALLY, THE BOM ISNT THAT IMPORTANT, SINCE WHEN I EDIT A FILE WITH VIM, IT
IS SAVED AS UTF-8 WITHOUT THE BOM, BUT PERL JUST WORKS FINE. EXCEPT THAT
YOU HAVE TO DO A "USE UTF8;" INSIDE THE SOURCE.
OH, AND OF COURSE MAKE SURE THAT EVERY INPUT YOU READ IN IS DECODED, AND
EVERY OUTPUT YOU WRITE OUT IS ENCODED.
SEE ATTACHED NO_UTF8.PL WHICH FAILS, AND UTF8.PL WHICH WORKS.
WORKING WITH UNICODE WITHOUT "USE UTF8;" AND EXPLICIT DECODE/ENCODE FOR ALL
DATA IS NOT A SANE IDEA IMO.
THE ONLY THING I WONDER IS WHAT HAPPENS IF YOU HAVE A CONSTANT STRING BEFORE
YOU LOAD UTF8.PM - MAYBE IN A BEGIN BLOCK OR SOMETHING?
SO, FOR REGEXP, COULD YOU JUST CHECK IF UTF8 IS IN EFFECT AT THE CURRENT
SCOPE?
ALL THE BEST,
TELS
--
SIGNED ON MON FEB 26 17:41:09 2007 WITH KEY 0X93B84C15.
GET ONE OF MY PHOTO POSTERS: HTTP://BLOODGATE.COM/POSTERS
PGP KEY ON HTTP://BLOODGATE.COM/TELS.ASC OR PER EMAIL.
"I WANT TO SQUIRT YOU A PICTURE OF MY KIDS. YOU WANT TO SQUIRT ME BACK A
VIDEO OF YOUR VACATION. THAT'S A SOFTWARE EXPERIENCE."
-- STEVE BALLER ON THE ZUNE
|
Approximate file size 107 bytes |
Approximate file size 120 bytes |
[1]
|
|
|
about | contact Other archives ( Real Estate discussion Medical topics )
|