List Info

Thread: UTF-8 and detecting encoding.




UTF-8 and detecting encoding.
user name
2006-01-25 15:40:39
Just trying to get my head wrapped around the utf8 issues.

I've got utf8 content in Postgresql and generating utf8
output.
That's all working fine.

I have other cases where content is pulled from disk.  I
understand
that if the content has a BOM then it will be correctly read
by TT.
I'm just passing an open file handle to process(), so I'm
not reading
in the content before passing off to TT.

I'm seeing some content that's cp1252, though.  So, I'm
wondering: how
I can still pass in a file handle, but have the content
decoded
correctly?

-- 
Bill "utf8-challenged" Moseley
moseleyhank.org


_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
UTF-8 and detecting encoding.
user name
2006-01-25 20:37:54
Bill,

This is not a direct answer to your question, but my
Template::Provider::Encoding module will help your
situation. When you
use it with Stash::ForceUTF8, you don't have to care about
UTF-8
auto-upgrading problem.
http://search.cpan.org/~miyagawa/Template-Provi
der-Encoding-0.03/

On 1/25/06, Bill Moseley <moseleyhank.org> wrote:
> Just trying to get my head wrapped around the utf8
issues.
>
> I've got utf8 content in Postgresql and generating utf8
output.
> That's all working fine.
>
> I have other cases where content is pulled from disk. 
I understand
> that if the content has a BOM then it will be correctly
read by TT.
> I'm just passing an open file handle to process(), so
I'm not reading
> in the content before passing off to TT.
>
> I'm seeing some content that's cp1252, though.  So, I'm
wondering: how
> I can still pass in a file handle, but have the content
decoded
> correctly?
>
> --
> Bill "utf8-challenged" Moseley
> moseleyhank.org
>
>
> _______________________________________________
> templates mailing list
> templatestemplate-toolkit.org
> http://lists.template-toolkit.org/mailman/listinfo/t
emplates
>


--
Tatsuhiko Miyagawa

_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
UTF-8 and detecting encoding.
user name
2006-01-26 01:04:42
On Wed, Jan 25, 2006 at 12:37:54PM -0800, Tatsuhiko Miyagawa
wrote:
> 
> This is not a direct answer to your question, but my
> Template::Provider::Encoding module will help your
situation. When you
> use it with Stash::ForceUTF8, you don't have to care
about UTF-8
> auto-upgrading problem.

Thanks.  I like this idea as it forces the encoding to be
defined in
the templates.  Might be nice to specify something other
than utf8 as
the default encoding if not encoding is specified in the
template,
though.

Since Template::Provider::Encoding calls
Template::Provider::_load
should it not check for the utf8 flag before trying to
decode it?
If a template had a BOM then returned data would already be
decoded.

The other option would be to set UNICODE => 0, but that
would not
handle the case of a scalar being passed in that was already
utf8.


Now, if there was a module that prevented people from
pasting from MS
Word.



Few comments/questions about TT's handling of encoding. 
Please
correct me if I'm wrong about anything.


Template::Provider will attempt to determine the encoding by
BOM for
templates supplied by file name or a handle.  Scalar refs
are not
touched, so they need to be correctly decoded before passed
to
process().

This BOM detection happens automatically for perl >
5.007.

There's a "UNICODE" option to provider.  Thus,
this feature can be
disabled.  It seems that this option is not documented
currently (in
my quick grep).

Obviously, you need an editor or some way to write the BOM
to all the
template files to use this feature.


Now:

- If a BOM is not found then the text is left alone.  It
might be nice
to specify a default encoding so that if no BOM is found
then the
text is still decoded instead of left as raw data.

So, in my case I could specify cp1252 and if UTF8 is not
detected by
BOM then it is assumed that it's 1252 and then converted to
a perl
string.

- I also wonder if _decode_unicode should just return if the
input
text is already flagged as uft8.  This would be useful when
supplying
a file handle that already has a PerlIO Layer set. 
Currently if you
pass in a file handle with <:utf set you will get:

  Cannot decode string with wide characters at
/usr/lib/perl/5.8/Encode.pm line 166, <$fh> chunk 1.

if the file also contains a BOM.



Oh, BTW.  Isn't this suppose to be correct according to the
IO::File
docs?

    $ perl -MIO::File -le
"IO::File->new('utf8.html',
'r')->binmode(':utf8')"
    usage $fh->binmode([LAYER]) at -e line 1

This works, though:

    binmode($fh, ':utf8')



-- 
Bill Moseley
moseleyhank.org


_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
UTF-8 and detecting encoding.
user name
2006-01-26 18:42:15
On 1/25/06, Bill Moseley <moseleyhank.org> wrote:

> > This is not a direct answer to your question, but
my
> > Template::Provider::Encoding module will help your
situation. When you
> > use it with Stash::ForceUTF8, you don't have to
care about UTF-8
> > auto-upgrading problem.
>
> Thanks.  I like this idea as it forces the encoding to
be defined in
> the templates.  Might be nice to specify something
other than utf8 as
> the default encoding if not encoding is specified in
the template,
> though.

There's a very similar module on CPAN to do that:
http://search.cpan.org/dist/Template-Provider-Encode/

> Since Template::Provider::Encoding calls
Template::Provider::_load
> should it not check for the utf8 flag before trying to
decode it?
> If a template had a BOM then returned data would
already be decoded.

Good point. This should be fixed in a next release.
Personally I never
use BOM in the templates.


--
Tatsuhiko Miyagawa

_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )