List Info

Thread: Smack!




Smack!
user name
2007-04-18 06:34:18
Right.  Nobody else saw it.  That's what I thought, but I
wanted to 
give it an overnight before I let the other shoe fall.

The bug is that this module won't work in a progam that
has a "use encoding" pragma in it.

    % perl -Mencoding=utf8 -MSmack -e 'smack &&
snarf && print "hurray"'
    Wide character in print at Smack.pm line 50.
    Wide character in print at Smack.pm line 51.
    bindata is size 24 (should be 20)
    Wide character in $/ at Smack.pm line 63.
    Exit 255

It needs a "use bytes" in Smack.pm for it to work
correctly.

That's because otherwise the non-scoped use encoding reaches
into the
module and alters how Perl deals with data.  The
encoding::warnings pragma
can be useful for diagnosing this, but one shouldn't have
to.  It's a bug.
I was hoping that use encoding had become correctly scoped,
but it hasn't.

Now, in this simple example, which can be written out this
way:

    use encoding 'utf8';
    use Smack;
    smack() && snarf() && print
"hurray!n";

one need but move the use encoding to after the use Smack

    use Smack;
    use encoding 'utf8';
    smack() && snarf() && print
"hurray!n";

Also in this simple example, if you put "use
encoding::warnings" into
the Smack.pm module, it diagnoses and cure the problem.  

However, in more elaborate scenarios, neither of those
works.
For example:

    #!/usr/bin/perl 
    use strict;
    use warnings;
    use Image::ExifTool qw(ImageInfo);
    use encoding 'utf8';  # comes after, but still screwed

    if (!ARGV) {
	die "usage: $0 filename ...n";
    }

    for my $filename (ARGV) { 
	my $info = ImageInfo($filename);

	if (my $error = $info->) {
	    warn "Can't parse image info on file $filename:
$errorn";
	    next;
	} 

	if (my $oops = $info->) {
	    warn "WARNING: Can't parse image info on file
$filename: $oopsn";
	    # fallthrough
	}

	printf "%s is size %sn", $filename,
$info->;
    }

Run that on a JPEG file and it will bomb if there's a 

    use encoding 'utf8';

in the main program, even one that comes after the module
load.
You have to put "use bytes" into the
Image::ExifTool module for
it to work.  What's happening is that its

    "xff" . chr($marker) 

code (amongst other places) is being shamelessly
"promoted" into 
a Unicode-encoded string, starting from an assumed
ISO-8859-1.
This will produce now a 4-byte string that is
"357277275".
This is of course completely nuts.  

The bug IMHO is that use encoding is not lexically scoped. 
It 
affects code everywhere, and you should not think that what
it
says about "no encoding" actually doing you much
good, or placing
the "use encoding" after the modules are sucked
in.  That's not 
always good enough.  It isn't here.

Audrey's "encoding::warnings" pragma will find
these for you.  Add "use
encoding::warnings" instead of "use bytes" to
the ExifTool module, and you
find this:

Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 748
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 749
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2178
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2179
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2317
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2320
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2320
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2489
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2522
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2529
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2579
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2607
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2608
Bytes implicitly upgraded into wide characters as iso-8859-1
at lib/Image/ExifTool.pm line 2658

I believe one might be able to do something at all those
points to get it
to behave better involving calls to specific encode or
decode routines from
Encode, but it's easiest to just say use bytes and be done. 
But you 
should not have to do this!

Now, it's not *always* enough to just place a use bytes in
your own 
module code.  For example

    # Module BadEnc.pm
    use bytes;
    my $i = 0;
    sub main::func { return "xff" . chr($i) }
    1;

BTW, you get a different answer writing 

    sub main::func { return "xff" . chr(0) }

than the chr($i), because it gets optimized into what is
effectively

    sub main::func { return "xffx00" }

and so gets encoded up differently in the implicit
conversion.

You can then run this:

    use BadEnc;
    use encoding "utf8";
    $word = func();
    for $i ( 0 .. (bytes::length($word)-1) ) {
	printf "char #%d has code point %dn", $i,
bytes::ord (bytes::substr($word, $i, 1))
    }
    for $i ( 0 .. (length($word)-1) ) {
	printf "char #%d has code point %dn", $i, ord
(substr($word, $i, 1))
    }
    print $word;

And you'll see that you still have a problem 

    char #0 has code point 255
    char #1 has code point 0
    char #0 has code point 65533
    char #1 has code point 0

I've omitted the last line of output, but it is what gave me
the
string that I ran through "od -c" to find that
it's "357277275".

BTW, placing use encoding::warnings into BadEnc.pm gives
this now as output:

    Bytes implicitly upgraded into wide characters as
iso-8859-1 at BadEnc.pm line 4
    char #0 has code point 195
    char #1 has code point 191
    char #2 has code point 0
    char #0 has code point 255
    char #1 has code point 0

Which shows you would still have a problem.  A different
problem, but still
a problem. 

I did test the "use bytes" in Image/ExifTool.pm,
and it makes you suddenly
able to read JPEGs correctly again.  So the module author
has to do that.
And so does anyone, I guess.  That sucks.  This program
needs to know that
when it uses a string like "xFFxFD" and jumps
around a binary file, it
doesn't have its data mutilated on it.  That's a disaster.

Supposedly there are plans to make "use encoding"
a properly lexically
scoped pragma in 5.9, but I don't know how far along that
is.

I *did* leave clues: the strange record separator, the
strange way I
constructed it, my reference to Audrey, and speaking
pragmatically.  It
shows that not even perl5-porters are sensitized to this
issue.  Since you
are not, it's not all that reasonable to expect all module
writers to be
sensitive to it.

--tom

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )