On 2/17/07, demerphq <demerphq gmail.com> wrote:
> On 2/17/07, via RT John Berthels
<perlbug-followup perl.org> wrote:
> > # New Ticket Created by "John
Berthels"
> > # Please include the string: [perl #41527]
> > # in the subject line of all future correspondence
about this issue.
> > # <URL: h
ttp://rt.perl.org/rt3/Ticket/Display.html?id=41527 >
> >
> >
> > This is a bug report for perl from jjberthels gmail.com,
> > generated with the help of perlbug 1.35 running
under perl v5.8.8.
> >
> >
> >
------------------------------------------------------------
-----
> > [Please enter your report here]
> >
> > Hi.
> >
> > The documentation for the 'decode' function in
Encode.pm states:
> >
> > ...the utf8 flag for $string is on unless
$octets entirely
> > consists of ASCII data...
> >
> > but it appears that decode turns on the flag even
if the input string is
> > plain ASCII. A test case demonstrating this is
appended below.
> >
> > I understand this doesn't make a difference from a
correctness point of
> > view, but it does change the peformance
characteristics, presumably due
> > to the use of the unicode regex engine (the
profile showed something like
> > SWASHNEW taking a lot of time).
> >
> > An older version of Encode (I believe v 2.01) had
the behaviour described
> > in the docs, and would have passed the test case
below.
> >
> > In my case, the application is required to process
utf8 data correctly,
> > but the vast majority of data is plain ascii. This
change in behaviour from
> > 2.01 is causing a noticeable increase in CPU
usage.
> >
> > I'm currently working around this with a regexp
test /[x80-xff]/ on the
> > byte string and avoiding calling Encode::decode in
this case, but a quick
> > check on perlmonks led to a suggestion that I
raise this as a perlbug:
> > http://perlmonks
.org/?node_id=600050 (although opinion was divided on
> > whether this was a bug).
> >
> > I've taken a quick look at the XS and can see an
unconditional SvUTF8_on(dst)
> > on line 453. I don't know whether a good fix would
be to add an additional
> > loop over the string to check the flag there or
keep the 'only loop over
> > the string once' behaviour by passing a "was
the string plain ascii" flag
> > back from process_utf8().
>
> I looked into more or less this strategy, but well, im
not sure if it works out.
>
> I have to say the code in Encode.* is kinda confusing
to this ascii
> type programmer.
>
> > I'll happily try to whip up a patch of either
solution if you agree this
> > needs changing and let me know which approach you
prefer.
> >
> > regards,
> >
> > jb
> >
> >
> > #!/usr/bin/perl
> > use warnings;
> > use strict;
> > use Test::More (tests => 2);
> >
> > use Encode;
> >
> > my $ascii_bytes = "lxf8xf8k - a latin1
string";
> > my $latin1_bytes = "this is plain
ascii";
>
> Er, arent these backwards? xf8 isnt in ascii, its in
latin1. ascii is
> a 7 bit encoding.
>
> > my $encoded_str =
Encode::decode_utf8($latin1_bytes);
> > ok(Encode::is_utf8($encoded_str),
> > "(check encode is working) non-ascii
latin-1 byte string becomes char str");
> >
> > $encoded_str = Encode::decode_utf8($ascii_bytes);
> > ok(! Encode::is_utf8($encoded_str),
> > "but ascii byte string untagged afeter
decode");
>
> I changed the code to the attached perl script,
encode.pl and I get
> the attached output with perl 5.8.6 encode version
2.09:
>
> D:devperlverzorowin32>perl encode.pl
> 1..2
> SV = PV(0x15d5914) at 0x1a6c864
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
> PV = 0x15dd674 "l303270303270k - a latin1
string" |