List Info

Thread: Re: Regexp failure with utf8-flagged string and byte-flagged pattern




Re: Regexp failure with utf8-flagged string and byte-flagged pattern
user name
2007-09-22 04:50:37
On 9/22/07, Tels <nospam-abusebloodgate.com> wrote:
> Moin,
>
> On Friday 21 September 2007 23:56:56 demerphq wrote:
> > On 9/21/07, demerphq <demerphqgmail.com> wrote:
> > > But we need to make sure this is fixed before
5.10 is released.
> >
> > Just to expand on this, somewhere in or around the
make_trie code is
> > some logic that turns on a bit in a bit vector for
every start byte in
> > the trie. In the branch for handling non unicode
data it needs to do
> > something like the following pseudo code.
> >
> > /* store first byte of utf8 representation of
codepoints in the 127 <
> > cp < 256 range */
> > if (127 < cp && cp < 192) {
> >    SETBIT(CHARCLASS,194)
> > } else if (191 < cp && cp < 256) {
> >    SETBIT(CHARCLASS,195)
> > }
>
> Neither SETBIT nor "vector" appear in the
source. In the end greppign
> for "bitfield" leads to line 1392 which looks
like:
>
>         if ( set_bit ) /* bitmap only alloced when
!(UTF&&Folding) */
>             TRIE_BITMAP_SET(trie,*uc); /* store the raw
first byte
>                                           regardless of
encoding */
>
>         for ( ; uc < e ; uc += len ) {
>             TRIE_CHARCOUNT(trie)++;
>             TRIE_READ_CHAR;
>            chars++;
>             if ( uvc < 256 ) {
>                 if ( !trie->charmap[ uvc ] ) {
>                     trie->charmap[ uvc ]=(
++trie->uniquecharcount );
>                     if ( folder )
>                         trie->charmap[ folder[ uvc ]
] = trie->charmap[
> uvc ];
>                     TRIE_STORE_REVCHAR;
>                 }
>                 if ( set_bit ) {
>                     /* store the codepoint in the
bitmap, and if its ascii
>                        also store its folded
equivelent. */
>                     TRIE_BITMAP_SET(trie,uvc);
>                     if ( folder )
TRIE_BITMAP_SET(trie,folder[ uvc ]);

Right there. The line that says

                       if ( folder )
TRIE_BITMAP_SET(trie,folder[ uvc ]);

should probably read

                       if ( folder ) { /* folder only true
when
pattern is not utf8 */
                           TRIE_BITMAP_SET(trie,folder[ uvc
]); /*
store the folded codepoint */
                           /* store first byte of utf8
representation of
                              codepoints in the 127 < uvc
< 256 range */
                           if (127 < uvc && uvc
< 192) {
                               TRIE_BITMAP_SET(trie,194)
                           } else if (191 < uvc ) { /*
&& uvc < 256 --
we know uvc is < 256 already */
                               TRIE_BITMAP_SET(trie,195)
                           }
                       }

>                     set_bit = 0; /* We've done our bit
 */
>                 }
>             } else {
>                 SV** svpp;
>                 if ( !widecharmap )
>                     widecharmap = newHV();
>
>                 svpp = hv_fetch( widecharmap,
(char*)&uvc, sizeof( UV ),
> 1 );
>
>                 if ( !svpp )
>                     Perl_croak( aTHX_ "error
creating/fetching widecharmap
> entry for 0x%"UVXf, uvc );
>
>                 if ( !SvTRUE( *svpp ) ) {
>                     sv_setiv( *svpp,
++trie->uniquecharcount );
>                     TRIE_STORE_REVCHAR;
>                 }
>             }
>
>
> and I believe in the first branch the modification
needs to be done.
> However, I am not sure what to insert where.

Thanks a lot for digging that out, its exactly what i needed
to see.

Can you try the code as Ive indicated above and let me know
if it
solves the problem?

Cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )