List Info

Thread: Re: Unicode regex confusion




Re: Unicode regex confusion
user name
2007-02-26 10:46:51
MOIN, DEMERPHQ GMAIL.COM> WROTE: >ON 2/26/07, DAVE MITCHELL IABYN.COM> WROTE: [SNIP] >> IF WE LEAVE IT TO THE ENGINE TO HANDLE X{..} EXPANSION, THEN IT AUGHT TO >> UPGRADE ANY CHR(128)..CHR(255)S IN A SIMILAR MANNER TO THE WAY THE >> DOUBLE-QUOETS STRING SCANNER DOES. I GUESS. ER MAYBE. >WELL, THE ORIGINAL SNIPPET WILL WORK IF ITS SAVED AS UTF8. WHICH I >TAKE TO MEAN THAT THE REGEX ENGINE CURRENTLY CAN'T PROPERLY UPGRADE >CHR(128)..CHR(255) AS IT HAS NO WAY TO KNOW IF IT IS TRUELY A >CORRUPTED UNICODE STRING OR A TRUE HIGH-BIT OCTECT. >SO THE QUESTION IS HOW DOES THE TOKER HANDLE THIS? HOW DOES IT KNOW >WHAT ENCODING THE SOURCE IS, AND GIVEN THAT IT DOES KNOW HOW CAN WE >TRANSMIT THIS INFORMATION TO THE REGEX ENGINE? OR IS IT DONE BY A >HEURISTIC THAT WE CAN COPY? IOW, "IF YOU SEE A HIGH BIT OCTECT AND IT >ISNT PART OF A VALID UTF8 SEQUENCE IT SHOULD BE UPGRADED?" > >LOOKING INTO THIS FURTHER THE UTF8.PL SCRIPT IS REPRESENTED ON DISK AS >THE FOLLOWING OCTECTS: > >>OD -T X1 UTF8.PL >0000000 EF BB BF 2F C3 9F 7C 5C 78 7B 32 35 36 7D 2F >0000017 > >WHERAS THE LATIN1.PL SCRIPT LOOKS LIKE: > >>OD -T X1 LATIN1.PL >0000000 2F DF 7C 5C 78 7B 32 35 36 7D 2F >0000013 > >CHECKING INTO IT "EF BB BF" IS THE BOM MARK FOR UTF8 WHICH SAYS TO ME >THAT THE TOKER A) UNDERSTANDS BOM MARKS, AND B) IS NOT TELLING THE >REGEX ENGINE WHETHER THE CONSTANT PART OF THE PATTERN IS UTF8 OR NOT. ACTUALLY, THE BOM ISNT THAT IMPORTANT, SINCE WHEN I EDIT A FILE WITH VIM, IT IS SAVED AS UTF-8 WITHOUT THE BOM, BUT PERL JUST WORKS FINE. EXCEPT THAT YOU HAVE TO DO A "USE UTF8;" INSIDE THE SOURCE. OH, AND OF COURSE MAKE SURE THAT EVERY INPUT YOU READ IN IS DECODED, AND EVERY OUTPUT YOU WRITE OUT IS ENCODED. SEE ATTACHED NO_UTF8.PL WHICH FAILS, AND UTF8.PL WHICH WORKS. WORKING WITH UNICODE WITHOUT "USE UTF8;" AND EXPLICIT DECODE/ENCODE FOR ALL DATA IS NOT A SANE IDEA IMO. THE ONLY THING I WONDER IS WHAT HAPPENS IF YOU HAVE A CONSTANT STRING BEFORE YOU LOAD UTF8.PM - MAYBE IN A BEGIN BLOCK OR SOMETHING? SO, FOR REGEXP, COULD YOU JUST CHECK IF UTF8 IS IN EFFECT AT THE CURRENT SCOPE? ALL THE BEST, TELS -- SIGNED ON MON FEB 26 17:41:09 2007 WITH KEY 0X93B84C15. GET ONE OF MY PHOTO POSTERS: HTTP://BLOODGATE.COM/POSTERS PGP KEY ON HTTP://BLOODGATE.COM/TELS.ASC OR PER EMAIL. "I WANT TO SQUIRT YOU A PICTURE OF MY KIDS. YOU WANT TO SQUIRT ME BACK A VIDEO OF YOUR VACATION. THAT'S A SOFTWARE EXPERIENCE." -- STEVE BALLER ON THE ZUNE
  Approximate file size 107 bytes
  Approximate file size 120 bytes
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )