List Info

Thread: Encode::Encoding and "Malformed UTF-8 character" warnings




Encode::Encoding and "Malformed UTF-8 character" warnings
user name
2007-02-26 12:12:41
Hi All,

I've come across a problem in writing encoding layers
derived from Encode::Encoding and think this either
needs fixing or documenting in Encode::Encoding.

Basically, when passing unicode data thro to an encode(),
it is easy to get "Malformed UTF-8" warnings from
multibyte UTF8 chracters being split in the middle of a
1024
byte buffer. Now i was able to fix this in the code i was
writing by using $| = 1, on the channel to force full lines
to be passed thro to the encode() calls. This solved my
problem (as all my lines were <1024 bytes). I then
reread
the docs and noticed the needs_lines() setting, which would
also have fixed the case where my code was breaking.

However, if you *do* have a line whose byte length is
>1024,
then it still cuts the string off at that point even if
needs_lines is set as 1 - thus risking splitting a
multibyte character. This also means that an encoding
with C<sub needs_lines > is susceptible to
(1) not getting complete lines, and (2) not getting
complete
strings (the final bytes of a multibyte character could be
missing, or the first few bytes are the remaining bytes of
the previously chopped multibyte character).


Here's some code to demonstrate the "Malformed
UTF-8" warning:

# %<
package Encode::Ident;
use warnings;
use strict;
use base qw(Encode::Encoding);

__PACKAGE__->Define('ident');

sub needs_lines ;
 
sub encode {
	my ($obj, $str, $chk) = _;
	my $result = $str;
	my $byte_len = bytes::length $str;
	# XXX calling length() below may trigger "Malformed
UTF-8":
	my $str_len = length $str;
	print STDERR "bytes::length=$byte_len 
tlength=$str_lenn";
	$_[1] = '' if $chk; # this is what in-place edit means
	return $result;
}

sub decode ;

##

package main;

use warnings;
use strict;

my $tmp_file = 'deme.tmp';

#END { unlink $tmp_file };

open FOUT, ">", $tmp_file or die "open:
$!";

binmode FOUT, ':encoding(Ident)' or die "binmode:
$!";

if (0) { # mimic needs_lines
	select( (select(FOUT), $| = 1)[0] );
}

# A long (>1024 bytes) string of multibyte UTF-8
characters:
my $long_uni = join '', map chr, 130..180, 21000..22000;

print STDERR "Data length=", length($long_uni),
"n";
print STDERR "Data bytes::length=",
bytes::length($long_uni), "nn";

# Pass the string thro to Encode::Ident::encode()
print FOUT $long_uni, "n";

close FOUT;
# >%

This gives me:

Data length=1052
Data bytes::length=3105

Malformed UTF-8 character (unexpected end of string) at
D:srcdevexeencoding_bug.pl line 15.
bytes::length=1024  	length=358
Malformed UTF-8 character (unexpected end of string) at
D:srcdevexeencoding_bug.pl line 15.
bytes::length=1024  	length=342
bytes::length=1024  	length=342
bytes::length=34  	length=12

-

It would be preferable if C<encode()> could guarantee
that it passed in a well-formed string; but if this not
feasible then this should be mentioned in the docs
for Encode::Encoding with perhaps a suggested method
(eg. Encode::CN::HZ::encode) of rebuilding the malformed
parts into a (well formed UTF-8) string. 

Many thanks,
alex.

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )