Hi All,
I've come across a problem in writing encoding layers
derived from Encode::Encoding and think this either
needs fixing or documenting in Encode::Encoding.
Basically, when passing unicode data thro to an encode(),
it is easy to get "Malformed UTF-8" warnings from
multibyte UTF8 chracters being split in the middle of a
1024
byte buffer. Now i was able to fix this in the code i was
writing by using $| = 1, on the channel to force full lines
to be passed thro to the encode() calls. This solved my
problem (as all my lines were <1024 bytes). I then
reread
the docs and noticed the needs_lines() setting, which would
also have fixed the case where my code was breaking.
However, if you *do* have a line whose byte length is
>1024,
then it still cuts the string off at that point even if
needs_lines is set as 1 - thus risking splitting a
multibyte character. This also means that an encoding
with C<sub needs_lines > is susceptible to
(1) not getting complete lines, and (2) not getting
complete
strings (the final bytes of a multibyte character could be
missing, or the first few bytes are the remaining bytes of
the previously chopped multibyte character).
Here's some code to demonstrate the "Malformed
UTF-8" warning:
# %<
package Encode::Ident;
use warnings;
use strict;
use base qw(Encode::Encoding);
__PACKAGE__->Define('ident');
sub needs_lines ;
sub encode {
my ($obj, $str, $chk) = _;
my $result = $str;
my $byte_len = bytes::length $str;
# XXX calling length() below may trigger "Malformed
UTF-8":
my $str_len = length $str;
print STDERR "bytes::length=$byte_len
tlength=$str_lenn";
$_[1] = '' if $chk; # this is what in-place edit means
return $result;
}
sub decode ;
##
package main;
use warnings;
use strict;
my $tmp_file = 'deme.tmp';
#END { unlink $tmp_file };
open FOUT, ">", $tmp_file or die "open:
$!";
binmode FOUT, ':encoding(Ident)' or die "binmode:
$!";
if (0) { # mimic needs_lines
select( (select(FOUT), $| = 1)[0] );
}
# A long (>1024 bytes) string of multibyte UTF-8
characters:
my $long_uni = join '', map chr, 130..180, 21000..22000;
print STDERR "Data length=", length($long_uni),
"n";
print STDERR "Data bytes::length=",
bytes::length($long_uni), "nn";
# Pass the string thro to Encode::Ident::encode()
print FOUT $long_uni, "n";
close FOUT;
# >%
This gives me:
Data length=1052
Data bytes::length=3105
Malformed UTF-8 character (unexpected end of string) at
D:srcdevexeencoding_bug.pl line 15.
bytes::length=1024 length=358
Malformed UTF-8 character (unexpected end of string) at
D:srcdevexeencoding_bug.pl line 15.
bytes::length=1024 length=342
bytes::length=1024 length=342
bytes::length=34 length=12
-
It would be preferable if C<encode()> could guarantee
that it passed in a well-formed string; but if this not
feasible then this should be mentioned in the docs
for Encode::Encoding with perhaps a suggested method
(eg. Encode::CN::HZ::encode) of rebuilding the malformed
parts into a (well formed UTF-8) string.
Many thanks,
alex.
|