List Info

Thread: faster Unicode.xs decode_xs




faster Unicode.xs decode_xs
user name
2007-10-10 12:28:12
Here's a patch that speeds up reading of unicode data
via Encode/Unicode's decode_xs() function. My test script
[*]
which simply reads utf16le data, runs in roughly half the
time.

It works by replacing the previous "grow the result
buffer for each
character" method with allocating a maximium possible
length
buffer to begin with. The patch uses for its maximum
possible
length C< ((ulen + 1) * UTF8_MAXLEN) > which is
massively
pessimistic ie. a smaller estimate should be possible.

NB. this method assumes that decode_xs() is not called with
an arbitrary length input buffer ie. we aren't potential
going to be hit by the above heavy handed UTF8_MAXLEN
multiplier
requiring such a large allocation that the SvGROW might
fail
when the previous code would not have.
However, the encoding.xs (always?) packages up the
to-be-decoded
buffers in <=1024 byte chunks, so i think it's ok.

Here's the patch which also includes the micro optimisation
of only looking up the "renewed" and
"ucs2" attributes if they
are needed (typically they are not).


# %<

--- Unicode.xs-orig	2007-10-10 12:07:35.294849100 +0100
+++ Unicode.xs	2007-10-10 16:42:45.175942200 +0100
 -96,12
+96,13 
 {
     U8 endian   = *((U8
*)SvPV_nolen(attr("endian", 6)));
     int size    =   SvIV(attr("size",   4));
-    int ucs2    = SvTRUE(attr("ucs2",   4));
-    int renewed = SvTRUE(attr("renewed",  7));
+    int ucs2    = -1; /* only needed in the event of
surrogate pairs */
     SV *result  = newSVpvn("",0);
     STRLEN ulen;
     U8 *s = (U8 *)SvPVbyte(str,ulen);
     U8 *e = (U8 *)SvEND(str);
+    U8 *root;
+
     ST(0) = sv_2mortal(result);
     SvUTF8_on(result);
 
 -124,15
+125,22 
     }
 #if 1
     /* Update endian for next sequence */
-    if (renewed) {
+    if (SvTRUE(attr("renewed", 7))) {
         hv_store((HV
*)SvRV(obj),"endian",6,newSVpv((char
*)&endian,1),0);
     }
 #endif
     }
+
+    /* Preallocate the temporary result buffer to the
maximum possible
size */
+    root = (U8 *) SvGROW(result, ((ulen + 1) *
UTF8_MAXLEN));
+
     while (s < e && s+size <= e) {
     UV ord = enc_unpack(aTHX_ &s,e,size,endian);
     U8 *d;
     if (issurrogate(ord)) {
+	if (ucs2 == -1) {
+	    ucs2 = SvTRUE(attr("ucs2", 4));
+	}
         if (ucs2 || size == 4) {
         if (check) {
             croak("%"SVf":no surrogates
allowed %"UVxf,
 -191,8
+199,7 
         }
     }
 
-    d = (U8 *) SvGROW(result,SvCUR(result)+UTF8_MAXLEN+1);
-    d = uvuni_to_utf8_flags(d+SvCUR(result), ord, 0);
+    d = uvuni_to_utf8_flags(root+SvCUR(result), ord, 0);
     SvCUR_set(result,d - (U8 *)SvPVX(result));
     }
     if (s < e) {

# >%

I think a similar technique might be applicable to speed up
the encode_xs code aswell.


[*] Here's the test script

# %<
use Time::HiRes;
my $f = 'utf16le.txt';
# create some UTF16
open F, ">:raw:perlio:encoding(utf16le)", $f
	or die "cannot write $f: $!n";
for (1..1000) {	
	print F "foo " x 20, "n";
}
close F;
open F, "<:raw:perlio:encoding(utf16le)", $f
	or die "cannot open $f: $!n";
my $start = Time::HiRes::time;
for (1..200) {
	seek F, 0, 0 or die "cannot seek: $!n";
	while (<F>) {}
}
my $end = Time::HiRes::time;
print "Took: ", ($end - $start), "n";
close F;
# >%


Also i noticed that EncodeUnicodeUnicode.xs and
EncodeEncode.xs both seem to have had their indentations
mangled in the last submission. It doesn't look
deliberate...
might be worth checking how this happened.

Thanks for your time.

Cheers, alex.

Re: faster Unicode.xs decode_xs
user name
2007-10-10 12:46:57
Moin,

On Wednesday 10 October 2007 19:28:12 Davies, Alex wrote:
> Here's a patch that speeds up reading of unicode data
> via Encode/Unicode's decode_xs() function. My test
script [*]
> which simply reads utf16le data, runs in roughly half
the time.
>
> It works by replacing the previous "grow the
result buffer for each
> character" method with allocating a maximium
possible length
> buffer to begin with. The patch uses for its maximum
possible
> length C< ((ulen + 1) * UTF8_MAXLEN) > which is
massively
> pessimistic ie. a smaller estimate should be possible.
>
> NB. this method assumes that decode_xs() is not called
with
> an arbitrary length input buffer ie. we aren't
potential
> going to be hit by the above heavy handed UTF8_MAXLEN
multiplier
> requiring such a large allocation that the SvGROW might
fail
> when the previous code would not have.
> However, the encoding.xs (always?) packages up the
to-be-decoded
> buffers in <=1024 byte chunks, so i think it's ok.
>
> Here's the patch which also includes the micro
optimisation
> of only looking up the "renewed" and
"ucs2" attributes if they
> are needed (typically they are not).

Cool work, I was about to into that a few months ago, but
got 
sidetracked and lost interest.

One other idea I had is to speed up the very popular
ISO-8859-1 => UTF-8 
encoding by just using a table lookup - where you map the
256 input 
bytes directly to the 1 or 2 UTF-8 bytes encoding in a
(long) table.

This conversion is always triggered if you f.i. compare a
ISO-8859-1 
string to an UTF-8 string. (maybe the conversion can even be
avoided in 
cases where you know you have X UTF-8 characters, but Y
ISO-8859-1 
bytes and X != Y - I don't know if the code does the
conversion or 
not.)

Maybe this gives you some ideas.

All the best,

Tels

-- 
 Signed on Wed Oct 10 19:44:14 2007 with key 0x93B84C15.
 View my photo gallery: http://bloodgate.com/phot
os
 PGP key on http://bloodgate.com/te
ls.asc or per email.

 "Zudem könnten nun nicht mehr nur Täter, sondern auch
Opfer abgehört
 werden, um diese besser zu schützen."

  -- Jörg Bode, FDP (htt
p://heise.de/newsticker/data/anw-11.12.03-003/)
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )