List Info

Thread: Re: decode_utf8 sets utf8 flag on plain ascii strings




Re: decode_utf8 sets utf8 flag on plain ascii strings
user name
2007-02-26 16:52:17
On 2/17/07, demerphq <demerphqgmail.com> wrote:
> On 2/17/07, via RT John Berthels
<perlbug-followupperl.org> wrote:
> > # New Ticket Created by  "John
Berthels"
> > # Please include the string:  [perl #41527]
> > # in the subject line of all future correspondence
about this issue.
> > # <URL: h
ttp://rt.perl.org/rt3/Ticket/Display.html?id=41527 >
> >
> >
> > This is a bug report for perl from jjberthelsgmail.com,
> > generated with the help of perlbug 1.35 running
under perl v5.8.8.
> >
> >
> >
------------------------------------------------------------
-----
> > [Please enter your report here]
> >
> > Hi.
> >
> > The documentation for the 'decode' function in
Encode.pm states:
> >
> >          ...the utf8 flag for $string is on unless
$octets entirely
> >          consists of ASCII data...
> >
> > but it appears that decode turns on the flag even
if the input string is
> > plain ASCII. A test case demonstrating this is
appended below.
> >
> > I understand this doesn't make a difference from a
correctness point of
> > view, but it does change the peformance
characteristics, presumably due
> > to the use of the unicode regex engine (the
profile showed something like
> > SWASHNEW taking a lot of time).
> >
> > An older version of Encode (I believe v 2.01) had
the behaviour described
> > in the docs, and would have passed the test case
below.
> >
> > In my case, the application is required to process
utf8 data correctly,
> > but the vast majority of data is plain ascii. This
change in behaviour from
> > 2.01 is causing a noticeable increase in CPU
usage.
> >
> > I'm currently working around this with a regexp
test /[x80-xff]/ on the
> > byte string and avoiding calling Encode::decode in
this case, but a quick
> > check on perlmonks led to a suggestion that I
raise this as a perlbug:
> > http://perlmonks
.org/?node_id=600050 (although opinion was divided on
> > whether this was a bug).
> >
> > I've taken a quick look at the XS and can see an
unconditional SvUTF8_on(dst)
> > on line 453. I don't know whether a good fix would
be to add an additional
> > loop over the string to check the flag there or
keep the 'only loop over
> > the string once' behaviour by passing a "was
the string plain ascii" flag
> > back from process_utf8().
>
> I looked into more or less this strategy, but well, im
not sure if it works out.
>
> I have to say the code in Encode.* is kinda confusing
to this ascii
> type programmer.
>
> > I'll happily try to whip up a patch of either
solution if you agree this
> > needs changing and let me know which approach you
prefer.
> >
> > regards,
> >
> > jb
> >
> >
> > #!/usr/bin/perl
> > use warnings;
> > use strict;
> > use Test::More (tests => 2);
> >
> > use Encode;
> >
> > my $ascii_bytes = "lxf8xf8k - a latin1
string";
> > my $latin1_bytes = "this is plain
ascii";
>
> Er, arent these backwards? xf8 isnt in ascii, its in
latin1. ascii is
> a 7 bit encoding.
>
> > my $encoded_str =
Encode::decode_utf8($latin1_bytes);
> > ok(Encode::is_utf8($encoded_str),
> >     "(check encode is working) non-ascii
latin-1 byte string becomes char str");
> >
> > $encoded_str = Encode::decode_utf8($ascii_bytes);
> > ok(! Encode::is_utf8($encoded_str),
> >     "but ascii byte string untagged afeter
decode");
>
> I changed the code to the attached perl script,
encode.pl and I get
> the attached output with perl 5.8.6 encode version
2.09:
>
> D:devperlverzorowin32>perl encode.pl
> 1..2
> SV = PV(0x15d5914) at 0x1a6c864
>   REFCNT = 1
>   FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
>   PV = 0x15dd674 "l303270303270k - a latin1
string" [UTF8
> "lxxk - a latin1 string"]
>   CUR = 24
>   LEN = 27
> not ok 1 - (check encode is working) non-ascii latin-1
byte string
> becomes char str
> #   Failed test '(check encode is working) non-ascii
latin-1 byte
> string becomes char str'
> #   in encode.pl at line 13.
> SV = NULL(0x0) at 0x1a6c720
>   REFCNT = 1
>   FLAGS = (PADBUSY,PADMY)
> ----------
> SV = PV(0x1bdb7c4) at 0x1bde2f4
>   REFCNT = 1
>   FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
>   PV = 0x1bf38f4 "this is plain ascii"
[UTF8 "this is plain ascii"]
>   CUR = 19
>   LEN = 22
> ok 2 - but ascii byte string untagged after decode
> SV = PVMG(0x1bda7b4) at 0x1bd7d74
>   REFCNT = 1
>   FLAGS = (PADBUSY,PADMY,POK,pPOK)
>   IV = 0
>   NV = 0
>   PV = 0x1bf095c "this is plain ascii"
>   CUR = 19
>   LEN = 20
> # Looks like you failed 1 test of 2.
>
> Note the null return for the unicode string with high
byte chars in it.
>
> Now here it is with a blead patch with the attached
patch, notice it
> has correct output for both strings and passes the
tests:
>
> D:devperlverzorowin32>..perl encode.pl
> 1..2
> SV = PV(0x1a46cc4) at 0x1a6802c
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1a4fd5c "l303270303270k - a latin1
string" [UTF8
> "lxxk - a latin1 string"]
>   CUR = 24
>   LEN = 28
> ok 1 - (check encode is working) non-ascii latin-1 byte
string becomes char str
> SV = PV(0x1b400bc) at 0x1b3bf94
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1b9cdfc "l303270303270k - a latin1
string" [UTF8
> "lxxk - a latin1 string"]
>   CUR = 24
>   LEN = 28
> ----------
> SV = PV(0x1bb6f1c) at 0x1b68ccc
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1b6012c "this is plain ascii"
[UTF8 "this is plain ascii"]
>   CUR = 19
>   LEN = 24
> ok 2 - but ascii byte string untagged after decode
> SV = PV(0x1bb6f1c) at 0x1b68bdc
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK)
>   PV = 0x1b2985c "this is plain ascii"
>   CUR = 19
>   LEN = 20
>
>
> Now here it is with an unpatched blead:
>
> Everything is up to date. 'nmake test' to run test
suite.
> 1..2
> SV = PV(0x1a46cc4) at 0x1a6802c
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1a4fd5c "l303270303270k - a latin1
string" [UTF8
> "lxxk - a latin1 string"]
>   CUR = 24
>   LEN = 28
> ok 1 - (check encode is working) non-ascii latin-1 byte
string becomes char str
> SV = PVMG(0x1b60b5c) at 0x1b3bf94
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   IV = 0
>   NV = 0
>   PV = 0x1b4aa14 "l303270303270k - a latin1
string" [UTF8
> "lxxk - a latin1 string"]
>   CUR = 24
>   LEN = 28
>   MAGIC = 0x1b4b544
>     MG_VIRTUAL = &PL_vtbl_utf8
>     MG_TYPE = PERL_MAGIC_utf8(w)
>     MG_LEN = 22
> ----------
> SV = PV(0x1bb7084) at 0x1b68ccc
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1bbf8bc "this is plain ascii"
[UTF8 "this is plain ascii"]
>   CUR = 19
>   LEN = 24
> not ok 2 - but ascii byte string untagged after decode
> #   Failed test 'but ascii byte string untagged after
decode'
> #   at encode.pl line 21.
> SV = PVMG(0x1b60b9c) at 0x1b68bdc
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   IV = 0
>   NV = 0
>   PV = 0x1a8d404 "this is plain ascii"
[UTF8 "this is plain ascii"]
>   CUR = 19
>   LEN = 20
>   MAGIC = 0x1bb9d94
>     MG_VIRTUAL = &PL_vtbl_utf8
>     MG_TYPE = PERL_MAGIC_utf8(w)
>     MG_LEN = 19
> # Looks like you failed 1 test of 2.
>
> Guess why we get this output? Because current
decode_utf8 no-ops when
> the input string is already utf8 (contrary to the
docs). Remove that
> noop line (line 196 in Encode.pm) and here is what
happens:
>
> Everything is up to date. 'nmake test' to run test
suite.
> 1..2
> SV = PV(0x1a46cc4) at 0x1a6802c
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1a4fd5c "l303270303270k - a latin1
string" [UTF8
> "lxxk - a latin1 string"]
>   CUR = 24
>   LEN = 28
> ok 1 - (check encode is working) non-ascii latin-1 byte
string becomes char str
> SV = PV(0x1b400bc) at 0x1b3bf94
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1b9cdfc "l357277275357277275k - a
latin1 string"
> [UTF8 "lxxk - a latin1
string"]
>   CUR = 26
>   LEN = 28
> ----------
> SV = PV(0x1bb6f1c) at 0x1b68ccc
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1b6012c "this is plain ascii"
[UTF8 "this is plain ascii"]
>   CUR = 19
>   LEN = 24
> not ok 2 - but ascii byte string untagged after decode
> #   Failed test 'but ascii byte string untagged after
decode'
> #   at encode.pl line 21.
> SV = PV(0x1bb6f1c) at 0x1b68bdc
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1b2985c "this is plain ascii"
[UTF8 "this is plain ascii"]
>   CUR = 19
>   LEN = 20
> # Looks like you failed 1 test of 2.
>
> Notice the xx, which are because the code
(line 431 in Encode.xs)
>
>     if (SvUTF8(src)) {
>     s = utf8_to_bytes(s,&slen);
>     if (s) {
>         SvCUR_set(src,slen);
>         SvUTF8_off(src);
>         e = s+slen;
>     }
>     else {
>         croak("Cannot decode string with wide
characters");
>     }
>     }
>
> Which doesnt seem logical, and when tracing the code
doesnt work. The
> valid utf8 sequence gets converted to its byte form and
then passed to
> utf8n_to_uvuni() which naturally fails to decode it.
>
> I hope this analysis is useful to someone, it seems to
me that the
> current behaviour of is wrong, but i dont understand it
well enough to
> say for sure.

Warnocked?

cheers,
yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )