|
List Info
Thread: RE: Mixing character sets
|
|
| RE: Mixing character sets |

|
2008-03-06 21:46:39 |
> 5) However, when using
'$burner->set_encoding('encoding(iso-8859-1)');',
> that curly quote should be converted from UTF-8 to
ISO-8859-1 during
> burning and thus display properly.
Well, that seems to be identifying the correct (or at least
an
acceptable) Unicode character for the quote that was pasted
in from
Word...
http://theorem.ca/~mvcorks/cgi-bin/unic
ode.pl.cgi?start=2000&end=206F
but I believe the problem is that 'x' is not a syntax
that web
browsers understand as representing a unicode character --
which is why
I had previously asked whether set_encoding() also results
in the html
entity being generated -- if it generated the html entity
syntax
“ instead, it would work.
Also, this might be worth a quick read:
http://en.wikipedia.org/wiki/Unicode_and_H
TML#HTML_document_characters
|
|
| RE: Mixing character sets |

|
2008-03-06 22:11:45 |
David P., thanks for your tips.
After spending hours researching this, I think I might have
a potential explanation, but no solution.
I think the issue is with characters, such as smart quotes,
derived from Microsoft Word, according to Wikepedia:
<snip>
Word processors have traditionally offered curved quotes to
users, because in printed documents curved quotes are
preferred to straight ones. Before Unicode was widely
accepted and supported, this meant representing the curved
quotes in whatever 8-bit encoding the software and
underlying operating system <htt
p://en.wikipedia.org/wiki/Operating_system> were
using - but the character sets for Windows <ht
tp://en.wikipedia.org/wiki/Microsoft_Windows> and
Macintosh <http
://en.wikipedia.org/wiki/Apple_Macintosh> used two
different pairs of values for curved quotes, and ISO 8859-1
<http://en
.wikipedia.org/wiki/ISO_8859-1> (typically the
default character set for the Unices <http://en.wikip
edia.org/wiki/Unix> and, until recently, Linux
<http://en.wiki
pedia.org/wiki/Linux> ) has no curved quotes, making
cross-platform compatibility a nightmare.
Compounding the problem is the "smart quotes"
feature mentioned above, which some word processors
(including Microsoft Word and OpenOffice.org <http:
//en.wikipedia.org/wiki/OpenOffice.org> ) use by
default. With this feature turned on, users may not have
realised that the ASCII-compatible straight quotes they were
typing on their keyboards ended up as something entirely
different.
</snip>
Source: http://en.wikipedia.org/wiki/Sm
art_quotes#Quotation_marks_in_electronic_documents
<http://en.wikipedia.org/wik
i/Smart_quotes#Quotation_marks_in_electronic_documents>
a>
And according to this list of ANSI characters not in
ISO-8859-1:
htt
p://www.alanwood.net/demos/charsetdiffs.html#a <http://www.alanwood.net/demos/charsetdiffs.html#a>
The characters in question are not part of ISO-8859-1, and
believe them to be part of Windows-1252 (CP1252). Thus, I'm
guessing that when converted to ISO-8859-1 in Bricolage,
there is no match, so the Unicode representation is returned
in the format 'x{...}'. Does this make sense to y'all.
However, this is not usable to me, so how do I convert
'x{...}' to something useful?
I've have now read more than I've ever wanted or expected to
about Unicode, character sets and character encodings ...
and I'm still confused. Sigh.
Chris
P.S. Apologies for the numerous posts.
|
|
| Re: Mixing character sets |

|
2008-03-10 17:37:21 |
On Mar 6, 2008, at 20:11, Schults, Chris wrote:
> And according to this list of ANSI characters not in
ISO-8859-1:
>
> htt
p://www.alanwood.net/demos/charsetdiffs.html#a <htt
p://www.alanwood.net/demos/charsetdiffs.html#a
> >
Yes, and this is why I wrote Encode::ZapCP1252: to convert
those bogus
characters to ASCII. I need to update it to optionally
convert them to
UTF-8.
> The characters in question are not part of ISO-8859-1,
and believe
> them to be part of Windows-1252 (CP1252). Thus, I'm
guessing that
> when converted to ISO-8859-1 in Bricolage, there is no
match, so the
> Unicode representation is returned in the format
'x{...}'. Does
> this make sense to y'all.
No, because Bricolage expects UTF-8 to be submitted to the
browser,
and it stores the data as UTF-8. So it never converts from
CP-1252 to
ISO-8859-1. It converts from CP-1252 to UTF-8, and then
later from
UTF-8 to ISO-8859-1. Of course, it only takes that first
step if
you've set your character set preference in Bricolage to
CP-1252.
Ah-ha! That's the bit I've been trying to remember for how
we've
recommended handling this issue in the past. Try changing
your
character set preference, then create a new story and paste
from Word,
and then try to preview it with a template that calls
$burner-
>set_encoding('encoding(iso-8859-1)');' and see if it
doesn't
properly come out as ISO-8859-1. That should work!
Of course, the only thing I cannot understand is why you
continue to
get "x", which is a UTF-8 character
> However, this is not usable to me, so how do I convert
'x{...}' to
> something useful?
I'm a little confused. Are you seeing a curly quote and
calling it
x" (which is how you can represent it in a Perl
double-quoted
string), or are you seeing the literal string
x"?
> I've have now read more than I've ever wanted or
expected to about
> Unicode, character sets and character encodings ... and
I'm still
> confused. Sigh.
It's all good stuff to know, and will pay off in the long
run, believe
me.
Best,
David
|
|
| RE: Mixing character sets |

|
2008-03-10 18:14:39 |
> Yes, and this is why I wrote Encode::ZapCP1252: to
convert those bogus
> characters to ASCII. I need to update it to optionally
convert them to
> UTF-8.
Ah, that's cool, but instead of approximations, I'm
converting to HTML entities from /autohandler.mc:
<%filter>
# replace Microsoft-1252 characters
encode_entities($_,
'€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜
š›œžŸ');
</%filter>
% $burner->set_encoding('encoding(iso-8859-1)');
% $burner->chain_next;
This appears to be working just fine.
> I'm a little confused. Are you seeing a curly quote and
calling it
> x" (which is how you can represent it in a
Perl double-quoted
> string), or are you seeing the literal string
x"?
Oh, sorry if I wasn't clear. I'm seeing the literal string.
Chris
|
|
| Re: Mixing character sets |

|
2008-03-11 14:58:37 |
On Mar 10, 2008, at 16:14, Schults, Chris wrote:
> Ah, that's cool, but instead of approximations, I'm
converting to
> HTML entities from /autohandler.mc:
>
> <%filter>
> # replace Microsoft-1252 characters
> encode_entities($_, '');
> </%filter>
> % $burner->set_encoding('encoding(iso-8859-1)');
> % $burner->chain_next;
>
> This appears to be working just fine.
>
>> I'm a little confused. Are you seeing a curly quote
and calling it
>> x" (which is how you can represent it
in a Perl double-quoted
>> string), or are you seeing the literal string
x"?
>
> Oh, sorry if I wasn't clear. I'm seeing the literal
string.
You're seeing the characters "x show up in your
output? That's
just bizarre.
David
|
|
| RE: Mixing character sets |

|
2008-03-11 15:24:32 |
<snip>
You're seeing the characters "x show up in your
output? That's
just bizarre.
</snip>
Not anymore since I started using encode_entities for these
unsafe characters, but prior to this, yes.
Chris
|
|
| Re: Mixing character sets |

|
2008-03-12 17:25:56 |
On Mar 11, 2008, at 13:24, Schults, Chris wrote:
> <snip>
> You're seeing the characters "x show up in
your output? That's
> just bizarre.
> </snip>
>
> Not anymore since I started using encode_entities for
these unsafe
> characters, but prior to this, yes.
Huh. I wonder what was generating those? That's justwrong.
It sounds
like a dumper module or something.
David
|
|
[1-7]
|
|