List Info

Thread: Re: Regexp failure with utf8-flagged string and byte-flagged pattern




Re: Regexp failure with utf8-flagged string and byte-flagged pattern
user name
2007-09-21 16:56:56
On 9/21/07, demerphq <demerphqgmail.com> wrote:
> On 9/21/07, slavenrezic.de <slavenrezic.de> wrote:
> > > On 9/20/07, via RT srezic  cpan.
org <perlbug-followupperl.org> wrote:
> > >> # New Ticket Created by  sreziccpan.org
> > >> # Please include the string:  [perl
#45605]
> > >> # in the subject line of all future
correspondence about this issue.
> > >> # <URL: h
ttp://rt.perl.org/rt3/Ticket/Display.html?id=45605 >
> > >>
> > >>
> > >> This is a bug report for perl from
sreziccpan.org,
> > >> generated with the help of perlbug 1.36
running under perl 5.10.0.
> > >>
> > >>
> > >>
------------------------------------------------------------
-----
> > >> The script below works as expected until
perl 5.8.8 (i.e. it prints
> > >> "1").
> > >> With perl5.10.0 the pattern does not
match anymore.
> > >>
> > >> Regards,
> > >>     Slaven
> > >>
> > >> #!perl
> > >> $string = 'Öschel';
> > >> utf8::upgrade($string);
> > >> warn $string =~
m{(?:Ö|&Ouml;)schel};
> > >> __END__
> > >
> > > I dont have a blead handy right now to test
with, could someone please
> > > send me the output of this with a
> > >
> > > use re Debug=>'ALL';
> > >
> > > right before the warn statement.
> > >
> >
> > See the attachment.
>
> Thanks to you and Merijn I can say with pretty good
certainty what the
> problem is.
>
> The trie code builds a char class during its
construction phase, and
> is not storing the first byte of the unicode
representation of
> codepoints between 128 and 255.
>
> The fix should be fairly straight forward but I dont
have access to
> the tools to do it myself just at the second.
>
> But we need to make sure this is fixed before 5.10 is
released.

Just to expand on this, somewhere in or around the make_trie
code is
some logic that turns on a bit in a bit vector for every
start byte in
the trie. In the branch for handling non unicode data it
needs to do
something like the following pseudo code.

/* store first byte of utf8 representation of codepoints in
the 127 <
cp < 256 range */
if (127 < cp && cp < 192) {
   SETBIT(CHARCLASS,194)
} else if (191 < cp && cp < 256) {
   SETBIT(CHARCLASS,195)
}

Anyway, if somebody feels like figuring out where this code
would go
(and adjusting it correctly, once you find where it is it
will be
obvious how to correct it) then it would be cool (Im pretty
sure it
will be in a utility macro defined just before the routine).
Otherwise
this will have to wait until my desktops are unpacked and
set up. (I
just moved apartment and havent finished unpacking yet)

Yves



-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Regexp failure with utf8-flagged string and byte-flagged pattern
user name
2007-09-22 03:00:27
MOIN,

ON FRIDAY 21 SEPTEMBER 2007 23:56:56 DEMERPHQ WROTE:
> ON 9/21/07, DEMERPHQ <DEMERPHQGMAIL.COM> WROTE:
> > BUT WE NEED TO MAKE SURE THIS IS FIXED BEFORE 5.10
IS RELEASED.
>
> JUST TO EXPAND ON THIS, SOMEWHERE IN OR AROUND THE
MAKE_TRIE CODE IS
> SOME LOGIC THAT TURNS ON A BIT IN A BIT VECTOR FOR
EVERY START BYTE IN
> THE TRIE. IN THE BRANCH FOR HANDLING NON UNICODE DATA
IT NEEDS TO DO
> SOMETHING LIKE THE FOLLOWING PSEUDO CODE.
>
> /* STORE FIRST BYTE OF UTF8 REPRESENTATION OF
CODEPOINTS IN THE 127 <
> CP < 256 RANGE */
> IF (127 < CP && CP < 192) {
>    SETBIT(CHARCLASS,194)
> } ELSE IF (191 < CP && CP < 256) {
>    SETBIT(CHARCLASS,195)
> }

NEITHER SETBIT NOR "VECTOR" APPEAR IN THE SOURCE.
IN THE END GREPPIGN 
FOR "BITFIELD" LEADS TO LINE 1392 WHICH LOOKS
LIKE:

        IF ( SET_BIT ) /* BITMAP ONLY ALLOCED WHEN
!(UTF&&FOLDING) */
            TRIE_BITMAP_SET(TRIE,*UC); /* STORE THE RAW
FIRST BYTE
                                          REGARDLESS OF
ENCODING */

        FOR ( ; UC < E ; UC += LEN ) {
            TRIE_CHARCOUNT(TRIE)++;
            TRIE_READ_CHAR;
           CHARS++;
            IF ( UVC < 256 ) {
                IF ( !TRIE->CHARMAP[ UVC ] ) {
                    TRIE->CHARMAP[ UVC ]=(
++TRIE->UNIQUECHARCOUNT );
                    IF ( FOLDER )
                        TRIE->CHARMAP[ FOLDER[ UVC ] ] =
TRIE->CHARMAP[ 
UVC ];
                    TRIE_STORE_REVCHAR;
                }
                IF ( SET_BIT ) {
                    /* STORE THE CODEPOINT IN THE BITMAP,
AND IF ITS ASCII
                       ALSO STORE ITS FOLDED EQUIVELENT. */
                    TRIE_BITMAP_SET(TRIE,UVC);
                    IF ( FOLDER )
TRIE_BITMAP_SET(TRIE,FOLDER[ UVC ]);
                    SET_BIT = 0; /* WE'VE DONE OUR BIT  */
                }
            } ELSE {
                SV** SVPP;
                IF ( !WIDECHARMAP )
                    WIDECHARMAP = NEWHV();

                SVPP = HV_FETCH( WIDECHARMAP,
(CHAR*)&UVC, SIZEOF( UV ), 
1 );

                IF ( !SVPP )
                    PERL_CROAK( ATHX_ "ERROR
CREATING/FETCHING WIDECHARMAP 
ENTRY FOR 0X%"UVXF, UVC );

                IF ( !SVTRUE( *SVPP ) ) {
                    SV_SETIV( *SVPP,
++TRIE->UNIQUECHARCOUNT );
                    TRIE_STORE_REVCHAR;
                }
            }


AND I BELIEVE IN THE FIRST BRANCH THE MODIFICATION NEEDS TO
BE DONE. 
HOWEVER, I AM NOT SURE WHAT TO INSERT WHERE.

ALL THE BEST,

TELS

-- 
 SIGNED ON SAT SEP 22 09:58:36 2007 WITH KEY 0X93B84C15.
 VIEW MY PHOTO GALLERY: HTTP://BLOODGATE.COM/PHOTOS
 PGP KEY ON HTTP://BLOODGATE.COM/TELS.ASC OR PER EMAIL.

 "I AM SOO CLUMSY TODAY." *CRASH*

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )