List Info

Thread: UnicodeString conversion truncation




UnicodeString conversion truncation
country flaguser name
Switzerland
2007-10-22 03:57:10
>    String>>#jsonPrintOn:
>       (self anySatisfy: [ :ch | ch value between: 128
and: 255 ])
>              ifTrue: [ self asUnicodeString
jsonPrintOn: aStream ]
>              ifFalse: [ super jsonPrintOn: aStream ]
> 
> Why print strings that have non-ascii chars
differently?

Because, say, an UTF-8-encoded string containing the
characters 195 and 
160 should print as "u00E0", not as "à"
(that's a lowercase accented 
'a').  The easiest way to convert the two bytes to a single
character is 
with #asUnicodeString: in GNU Smalltalk, Strings are bytes
and 
UnicodeStrings are characters.

Actually, to support ISO-2022-JP and similar encodings
(which use a 
sequence introduced by ESC to switch between latin and
double-byte 
characters), one of us should probably change jsonPrintOn:
to use

     (self allSatisfy: [ :ch | ch value between: 32 and: 126
])
	ifFalse: [ self asUnicodeString jsonPrintOn: aStream ]
         ifTrue: [ super jsonPrintOn: aStream ]

(Note that you can safely skip: even this, unfortunately,
would not 
cater for UTF-7.  You can skip this because UTF-7 is
terminally broken, 
and all you should do with UTF-7 is convert it to a saner
encoding as 
soon as you read something in UTF-7.)

> And this in the string parsing code:
> 
>             c = $u
>                ifTrue: [
>         c := (Integer readFrom: (stream next: 4)
readStream radix: 16) asCharacter.
>         (c class == UnicodeCharacter and: [ str species
== String ])
>           ifTrue: [ str := (UnicodeString new
writeStream
>                nextPutAll: str contents; yourself) ]
].
>          ].
>       str nextPut: c.

What it does now is to operate on UnicodeStrings if it
considers it 
necessary; if there are no uXXXX escapes, it uses String
because valid 
JSON only has 7-bit characters in strings.

> Would you object if I change the json code to operate
on UnicodeStrings only?

I would like to understand why you need this, but no, I
would not object 
especially because I consider JSON your code, not mine.  I
just helped a 
bit.  

I think you wouldn't be able to operate on UnicodeStrings
only, unless I 
fix the bug with String/UnicodeString hashes (see below).

I don't know if after the explanation above you still want
JSON to 
operate on UnicodeStrings only.

> Stricly and semantically the JSON implementation should
only operate on UnicodeStrings
> as JSON is only parseable in Unicode. (I wonder what
happens with the current JSON reader
> when it encounters a utf-16 encoded String, as far as
my test went, it just didn't
> work because it doesn't expect multibyte encodings in
String).

JSON is not supposed to include non-Latin-1 characters. 
Everything 
that's not 7-bit encodable should be escaped using uXXXX.

> What puzzles me is the question what
JSONReader>>#nextJSONString should
> return. Should it be a String or a UnicodeString?

Strictly speaking it should return a UnicodeString, but it's
easier to 
use it, and faster, if (when it's possible) we let it return
a String. 
Switching to UnicodeStrings as soon as we find a uXXXX is a

conservative approximation of "when it's
possible".

Probably, what is missing from GNU Smalltalk's Iconv package
is an 
"Encoding" object that can answer queries like
"is this string pure 
ASCII?", the default very slow implementation being
something like this:

     str := self asString.
     uniStr := self asUnicodeString.
     str size = uniStr size ifFalse: [ ^false ].
     str with: uniStr do: [ :ch :uni |
         ch value = uni codePoint ifFalse: [ ^false ] ].
     ^true

This snippet would provide a more rigorous definition of
"when it's 
possible".

> If it returns UnicodeString no literal string access on
a Dictionary returned by
> the JSON parser will work as it would get only a String
object which has a different
> hash function than UnicodeString.

Hmmm, this has to be fixed.

Paolo



_______________________________________________
help-smalltalk mailing list
help-smalltalkgnu.org

http://lists.gnu.org/mailman/listinfo/help-smalltalk

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )