List Info

Thread: Encoding patch for Provider




Encoding patch for Provider
user name
2006-01-31 15:48:14
Andy,

Since the patches are flying, here's something from the list
discussion a week or so ago.

Should Provider test for utf8-ness before attempting to
check for BOM?
If you pass in a file handle that's already opened with a
layer then
you get warnings.

And would it be smart to have a TEMPLATE_ENCODING config
option to
tell Provider how to decode the content if not BOM is not
found?

An unanswered question is should TT also encode output, or
should
setting the output io layer be left to the user?

If you want I can try and find time for docs and test case,
although
the decode() I added only generates warnings (CHECK not
set), and I'm
not sure how to get Template::Test to check for warnings. 
That's
something you are probably more familiar with.


I also think something like Tatsuhiko Miyagawa's
T::P::Encoding
module would be nice core feature since specifying the
encoding of a
template is rather fundamental.  Could it be part of META, I
wonder?

    [% META encoding = 'utf8' %]

although I suppose the encoding needs to be determined
before META is
parsed.

First pass, something like:


cvs server: Diffing lib/Template
Index: lib/Template/Provider.pm
============================================================
=======
RCS file:
/template-toolkit/Template2/lib/Template/Provider.pm,v
retrieving revision 2.86
diff -u -B -r2.86 Provider.pm
--- lib/Template/Provider.pm    2006/01/30 20:04:54     2.86
+++ lib/Template/Provider.pm    2006/01/31 15:37:48
 -405,6
+405,7 
     $self-> = $params->;
 #   $self-> = $params->;
     $self-> = $params;
+    $self-> = $params->;
 
     # look for user-provided UNICODE parameter or use
default from package var
     $self-> = defined $params-> 
 -1008,6
+1009,8 
     my $self   = shift;
     my $string = shift;
 
+    return $string if Encode::is_utf8( $string );
+
     # try all the BOMs in order looking for one (order is
important
     # 32bit BOMs look like 16bit BOMs)
     my $count = 0;
 -1023,8
+1026,9 
         }
     }
 
-    # no boms matched so it must be a non unicode string
which we return as is
-    return $string;
+    return $self->
+        ? Encode::decode( $self->,
$string )
+        : $string;
 }
 
 

-- 
Bill Moseley
moseleyhank.org


_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
Encoding patch for Provider
user name
2006-02-01 08:51:05
Bill Moseley wrote:
> And would it be smart to have a TEMPLATE_ENCODING
config option to
> tell Provider how to decode the content if not BOM is
not found?

Hi Bill, 

Yes, this is a good idea.

> An unanswered question is should TT also encode output,
or should
> setting the output io layer be left to the user?

We can probably have one ENCODING config option to specify
the default
encoding for both input and output templates.  

If we know the specific encoding of the source template
(either by BOM or 
some kind of in-template flag) then we can also set the
output encoding.

> If you want I can try and find time for docs and test
case, although
> the decode() I added only generates warnings (CHECK not
set), and I'm
> not sure how to get Template::Test to check for
warnings.  That's
> something you are probably more familiar with.

I've applied your patch, although as ENCODING rather than
TEMPLATE_ENCODING.
Tests and docs would be great.

To catch warnings I usually do something like this:

  my $warning;
  local $SIG = sub {
      $warning = shift;
  };

  my $vars = {
    warning => sub { return $warning },
  };

  test_expect(*DATA, undef, $vars);

  __DATA__
 
  -- test --
  action: [% do_something('wrong') or 'failed' %]
  warning: [% warning %]
  -- expect --
  action: failed
  warning: do_something() does not accept a 'wrong' argument
  
> I also think something like Tatsuhiko Miyagawa's
T::P::Encoding
> module would be nice core feature since specifying the
encoding of a
> template is rather fundamental.  

Yes, I agree.  But the ad-hoc way of specifying the encoding
isn't 
reliable enough. 

> Could it be part of META, I wonder?
> 
>     [% META encoding = 'utf8' %]

That's preferable, but as you point out, we need to
determine the encoding
before we start scanning the content.  I suppose we could
assume ASCII
until we detect an encoding META tag, then decode the
content, then start
parsing again.

For this release, I think we'll offer the ENCODING option
and try and solve
the in-template encoding specification problem another day.

Cheers
A



_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
Encoding patch for Provider
user name
2006-02-01 09:57:07
Bill Moseley wrote:
 > >
 > > An unanswered question is should TT also encode
output, or should
 > > setting the output io layer be left to the user?

TT already has a 4th option to the 'process' method, an
options hash.
This takes a 'binmode' key, that can define the output IO
layer. eg

    $tt->process( $templatefile, $vars, $output, {
binmode => ':utf8' } );

will encode the output to utf-8 before writing. Andy - as
regards your
reply, I think it would be a bad idea to assume the same
input and
output template character sets. But that's just me.


tom


_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
Encoding patch for Provider
user name
2006-02-01 12:00:15
Tom Insam wrote:
> will encode the output to utf-8 before writing. Andy -
as regards your
> reply, I think it would be a bad idea to assume the
same input and
> output template character sets. But that's just me.

Having looked it over, I agree.

It's a more complex issue than I first thought.  I've
changed ENCODING
to only specify the default encoding for input templates. 
It doesn't
have any effect on the output encoding - you still have to
do that 
via the fourth process() argument.

I'll try and address the whole encoding isse more rigorously
for TT3.

A


_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
Encoding patch for Provider
user name
2006-02-01 17:09:18
On Wed, Feb 01, 2006 at 12:00:15PM +0000, Andy Wardley
wrote:
> Tom Insam wrote:
> > will encode the output to utf-8 before writing.
Andy - as regards your
> > reply, I think it would be a bad idea to assume
the same input and
> > output template character sets. But that's just
me.
> 
> Having looked it over, I agree.
> 
> It's a more complex issue than I first thought.  I've
changed ENCODING
> to only specify the default encoding for input
templates.  It doesn't
> have any effect on the output encoding - you still have
to do that 
> via the fourth process() argument.

I first started with "DEFAULT_ENCODING" and then
changed to
"TEMPLATE_ENCODING", but, it's really
"DEFAULT_TEMPLATE_ENCODING"
because it's after the BOM check.  But, I figured that big
option name
would force too many people to have to move all their
"=>" over in their
nice, neat, lined-up config option lists.




-- 
Bill Moseley
moseleyhank.org


_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )