List Info

Thread: UTF-8 and detecting encoding.




UTF-8 and detecting encoding.
user name
2006-01-25 16:51:31
On Wed, Jan 25, 2006 at 04:35:30PM +0000, Chisel Wright
wrote:
> On Wed, Jan 25, 2006 at 08:25:22AM -0800, Bill Moseley
wrote:
> > > Hopefully I'll learn something here one way
or the other, but it's this
> > > what the binmode argument to process() is
for?
> > > 
> > >  $tt->process($template, $data, $outfile,
binmode => ':utf8')
> > 
> > IIRC, that's only for files created by TT.
> 
> grr, TT is one of those lists that doesn't reply-to
list isn't it?

I didn't notice either.  Back to the list now.

 
> OK, so you're opening other files, I thought open had a
binmode too.
> 
> htt
p://perl.enstimac.fr/perl5.8.5/5.8.5/open.html ?
> 
> Of course, you should be using IO::File, which also
appears to have a
> binmode option.

I'm using IO::File, but the issue is I'm not sure of the
encoding
when I open the file.

I first assumed everything was 8859-1 on disk, then I notice
that
quite a few files were probably copy-n-pasted from Word or
something
and was cp1252.

I suppose I can just assume everything is cp1252 (includes
8851-1??),
but what happens if a file is actually utf8 on disk?

-- 
Bill Moseley
moseleyhank.org


_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
UTF-8 and detecting encoding.
user name
2006-01-25 19:59:01
Hi Bill,

We experienced this problem when we first switched to utf8,
and we went ahead and converted all web files from cp1252 to
utf8 to normalize everything. Most of our input is via the
website and not actual file uploads, so we were pretty safe
after that. If you have actual file uploads and such you may
need a cron job to automatically convert uploaded files.

You might want to check into: Lingua:etectCha
rset.
This can help you detect whether a file is already utf8 I
believe.

We simply looped over all our files, and if that module said
it was utf8, we skipped the file, otherwise we would encode
it from cp1252 to utf8 like so:
Encode::from_to($file_contents, "CP1252",
"utf8"); (Making a backup and all that jazz.)

I'd suggest keeping the backups around for a while until
you're sure things are ok.

If time is no object then I suppose you could just
check/convert on the fly each time.

-- Josh

Bill Moseley wrote:
> On Wed, Jan 25, 2006 at 04:35:30PM +0000, Chisel Wright
wrote:
>> On Wed, Jan 25, 2006 at 08:25:22AM -0800, Bill
Moseley wrote:
>>>> Hopefully I'll learn something here one way
or the other, but it's this
>>>> what the binmode argument to process() is
for?
>>>>
>>>>  $tt->process($template, $data,
$outfile, binmode => ':utf8')
>>> IIRC, that's only for files created by TT.
>> grr, TT is one of those lists that doesn't reply-to
list isn't it?
> 
> I didn't notice either.  Back to the list now.
> 
>  
>> OK, so you're opening other files, I thought open
had a binmode too.
>>
>> htt
p://perl.enstimac.fr/perl5.8.5/5.8.5/open.html ?
>>
>> Of course, you should be using IO::File, which also
appears to have a
>> binmode option.
> 
> I'm using IO::File, but the issue is I'm not sure of
the encoding
> when I open the file.
> 
> I first assumed everything was 8859-1 on disk, then I
notice that
> quite a few files were probably copy-n-pasted from Word
or something
> and was cp1252.
> 
> I suppose I can just assume everything is cp1252
(includes 8851-1??),
> but what happens if a file is actually utf8 on disk?
> 


_______________________________________________
templates mailing list
templatestemplate-toolkit.org
http://lists.template-toolkit.org/mailman/listinfo/t
emplates
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )