List Info

Thread: Re: UnicodeString conversion truncation




Re: UnicodeString conversion truncation
country flaguser name
Switzerland
2007-10-22 06:37:19
> The cleanes interface for the JSON parser/serializer
would be to
> receive and produce UnicodeStrings and let the
programmer worry about
> encoding.

I see.  An alternative is, in the case when you read
"uXXXX", to just 
return Strings.  To add a UnicodeCharacter to a String
stream, you just use

    aStream display: aCharacter

A full implementation would probably require adding a method
like this:

     PositionableStream >> encoding
         ^collection encoding

and I can take care of a more complete implementation of
Stream encoding.

There are many ways to specify encoding, for example the
following:

1) add a #on:encoding: constructor where the encoding
defaults to 
'UTF-8'.  When creating a String to be returned, use the
same encoding 
as the input.

2) use the aforementioned PositionableStream >>
encoding method; when 
creating a String to be returned, use the same encoding as
the input.

3) use the aforementioned PositionableStream >>
encoding method and add 
a #on:outputEncoding: constructor, where the encoding
defaults to the 
same encoding as the input.

4) use the aforementioned PositionableStream >>
encoding method and 
always return UnicodeStrings.  In this case, you will never
find 
Characters whose value is >= 128 in the input (you'll
find 
UnicodeCharacters instead!).

> Hm, I agree that hasing Strings in their UTF-8 encoded
form is a good approximation.
> Which will of course horribly break if someone chooses
to use eg. german "umlaute"
> in the source code in latin-1 encoding, or maybe not.
How is the encoding of a
> literal string determined?

It is not so far, and unless one is interested in using
Strings and 
UnicodeStrings interchangeably for hashing, you should not
care.  Do you 
have example of prior art for other languages?

>> ASCII characters and UTF-8 please.   I'm also
from a Latin-1 country, 
>> but I try to think as international as possible.

> 
> That Smalltalk source code literals come in UTF-8
encoded form is a bold
> assumption (which is increasingly right these days on
Linux and other OSs 

Yes.

Paolo


_______________________________________________
help-smalltalk mailing list
help-smalltalkgnu.org

http://lists.gnu.org/mailman/listinfo/help-smalltalk

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )