Sungjin Chun wrote:
> Hi,
>
> For me, string should not be limited to collection of
single byte
> characters. String is string not a simple collection of
byte, isn't it? I
> think squeak's approach (or OpenStep's approach,
where abstract public
> string class and concrete private subclasses of string
that implements
> several cases of string). But I'm not currently
working hard on GNU
> Smalltalk, this may not be the best idea for GNU
Smalltalk's case
>
There's already CharacterArray as a superclass of String.
It probably
would not be hard to have a UnicodeString subclass of
CharacterArray,
and use that instead of WordArray inside the I18N package.
I'd also
need UnicodeCharacter, probably.
I'm working on it in my spare time, I attach my current
prototype patch.
Paolo
--- orig/i18n/Sets.st
+++ mod/i18n/Sets.st
 -89,70
+89,70 
Namespace current: Smalltalk.I18N.Encoders!
-Encoder subclass: #FromUCS4
+Encoder subclass: #FromUTF32
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!
-FromUCS4 comment:
-'This class is a superclass for classes that convert from
UCS4
+FromUTF32 comment:
+'This class is a superclass for classes that convert from
UTF-32
characters (encoded as 32-bit Integers) to bytes in another
encoding (encoded as Characters).'!
-Encoder subclass: #ToUCS4
+Encoder subclass: #ToUTF32
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!
-ToUCS4 comment:
+ToUTF32 comment:
'This class is a superclass for classes that convert from
bytes
-(encoded as Characters) to UCS4 characters (encoded as
32-bit
+(encoded as Characters) to UTF-32 characters (encoded as
32-bit
Integers to simplify the code and to avoid endianness
conversions).'!
-ToUCS4 subclass: #ComposeUCS4LE
+ToUTF32 subclass: #ComposeUTF32LE
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!
-ComposeUCS4LE comment:
-'This class is used internally to provide UCS4 characters
encoded as
-32-bit integers for a descendent of FromUCS4, when the
starting
+ComposeUTF32LE comment:
+'This class is used internally to provide UTF-32
characters encoded as
+32-bit integers for a descendent of FromUTF32, when the
starting
encoding is little-endian.'!
-ToUCS4 subclass: #ComposeUCS4BE
+ToUTF32 subclass: #ComposeUTF32BE
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!
-ComposeUCS4BE comment:
-'This class is used internally to provide UCS4 characters
encoded as
-32-bit integers for a descendent of FromUCS4, when the
starting
+ComposeUTF32BE comment:
+'This class is used internally to provide UTF-32
characters encoded as
+32-bit integers for a descendent of FromUTF32, when the
starting
encoding is big-endian.'!
-FromUCS4 subclass: #SplitUCS4LE
+FromUTF32 subclass: #SplitUTF32LE
instanceVariableNames: 'wch'
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!
-SplitUCS4LE comment:
+SplitUTF32LE comment:
'This class is used internally to split into four 8-bit
characters
-the 32-bit UCS4 integers coming from a descendent of
ToUCS4, when
+the 32-bit UTF-32 integers coming from a descendent of
ToUTF32, when
the destination encoding is little-endian.'!
-FromUCS4 subclass: #SplitUCS4BE
+FromUTF32 subclass: #SplitUTF32BE
instanceVariableNames: 'count wch'
classVariableNames: ''
poolDictionaries: ''
category: 'i18n-Character sets'!
-SplitUCS4BE comment:
+SplitUTF32BE comment:
'This class is used internally to split into four 8-bit
characters
-the 32-bit UCS4 integers coming from a descendent of
ToUCS4, when
+the 32-bit UTF-32 integers coming from a descendent of
ToUTF32, when
the destination encoding is big-endian.'!
Encoder subclass: #Iconv
 -166,21
+166,21 
'This class is used to delegate the actual character set
conversion
to the C library''s iconv function. Most conversions use
iconv as
the only step in the conversions, sometimes the structure
is
-ToUCS4+SplitUCS4xx+Iconv or Iconv+ComposeUCS4xx+FromUCS4,
rarely
+ToUTF32+SplitUTF32xx+Iconv or
Iconv+ComposeUTF32xx+FromUTF32, rarely
Iconv is skipped altogether and only Smalltalk converters
are used.'!
-FromUCS4 subclass: #ToUTF7
+FromUTF32 subclass: #ToUTF7
instanceVariableNames: 'left value lookahead'
classVariableNames: 'Base64Characters DirectCharacters
ToBase64'
poolDictionaries: ''
category: 'i18n-Encodings'!
ToUTF7 comment:
-'This class implements a converter that transliterates
UCS4
+'This class implements a converter that transliterates
UTF-32
characters (encoded as 32-bit Integers) to UTF-7 encoded
characters.'!
-ToUCS4 subclass: #FromUTF7
+ToUTF32 subclass: #FromUTF7
instanceVariableNames: 'shift wch lookahead'
classVariableNames: 'DirectCharacters FromBase64'
poolDictionaries: ''
 -188,7
+188,7 
ToUTF7 comment:
'This class implements a converter that transliterates
UTF-7
-encoded characters to UCS4 values (encoded as 32-bit
Integers).'!
+encoded characters to UTF-32 values (encoded as 32-bit
Integers).'!
Namespace current: Smalltalk.I18N!
 -241,9
+241,9 
!Encoder methodsFor: 'private - initialization'!
initializeFrom: fromEncoding to: toEncoding origin:
aStringOrStream
- origin := aStringOrStream isString
- ifTrue: [ aStringOrStream readStream ]
- ifFalse: [ aStringOrStream ].
+ origin := (aStringOrStream isKindOf: Stream)
+ ifFalse: [ aStringOrStream readStream ]
+ ifTrue: [ aStringOrStream ].
self flush
! !
 -258,27
+258,27 
}
!
-registerEncoderFor: arrayOfAliases toUCS4: toUCS4Class
fromUCS4: fromUCS4Class
+registerEncoderFor: arrayOfAliases toUTF32: toUTF32Class
fromUTF32: fromUTF32Class
"Register the two classes that will respectively
convert from the
- charsets in arrayOfAliases to UCS4 and vice versa.
+ charsets in arrayOfAliases to UTF-32 and vice versa.
The former class is a stream that accepts characters
and returns
- (via #next) integers representing UCS-4 character
codes, while
- the latter accepts UCS-4 character codes and converts
them to
+ (via #next) integers representing UTF-32 character
codes, while
+ the latter accepts UTF-32 character codes and converts
them to
characters. For an example see respectively FromUTF7
and ToUTF7
(I admit it is not a trivial example)."
EncodersRegistry := EncodersRegistry copyWith:
- { arrayOfAliases. toUCS4Class. fromUCS4Class }
+ { arrayOfAliases. toUTF32Class. fromUTF32Class }
! !
!EncodedStream class methodsFor: 'private -
triangulating'!
bigEndianPivot
"When only one of the sides is implemented in
Smalltalk
- and the other is obtained via iconv, we use UCS-4 to
+ and the other is obtained via iconv, we use UTF-32 to
marshal data from Smalltalk to iconv; answer whether
we
- should encode UCS-4 characters as big-endian."
+ should encode UTF-32 characters as big-endian."
^Memory bigEndian
!
 -287,29
+287,119 
and the other is obtained via iconv, we need a common
pivot encoding to marshal data from Smalltalk to
iconv.
Answer the iconv name of this encoding."
- ^self bigEndianPivot ifTrue: [ 'UCS-4BE' ] ifFalse: [
'UCS-4LE' ]
+ ^self bigEndianPivot ifTrue: [ 'UTF-32BE' ] ifFalse:
[ 'UTF-32LE' ]
!
-split: input
+split: input to: encoding
"Answer a pipe with the given input stream (which
produces
- UCS-4 character codes as integers) and whose output is
+ UTF-32 character codes as integers) and whose output
is
a series of Characters in the required pivot
encoding"
- ^self bigEndianPivot
- ifTrue: [ SplitUCS4BE on: input from: 'words' to:
'UCS4-BE' ]
- ifFalse: [ SplitUCS4LE on: input from: 'words' to:
'UCS4-LE' ].
+ ^(encoding = 'UCS-4BE' or: [ encoding = 'UTF-32BE'
])
+ ifTrue: [ SplitUTF32BE on: input from: 'UTF-32' to:
encoding ]
+ ifFalse: [ SplitUTF32LE on: input from: 'UTF-32' to:
encoding ].
!
-compose: input
+compose: input from: encoding
"Answer a pipe with the given input stream (which
produces
Characters in the required pivot encoding) and whose
output
- is a series of integer UCS-4 character codes."
- ^self bigEndianPivot
- ifTrue: [ ComposeUCS4BE on: input from: 'UCS4-BE' to:
'words' ]
- ifFalse: [ ComposeUCS4LE on: input from: 'UCS4-LE' to:
'words' ].
+ is a series of integer UTF-32 character codes."
+ ^(encoding = 'UCS-4BE' or: [ encoding = 'UTF-32BE'
])
+ ifTrue: [ ComposeUTF32BE on: input from: encoding to:
'UTF-32' ]
+ ifFalse: [ ComposeUTF32LE on: input from: encoding to:
'UTF-32' ].
! !
!EncodedStream class methodsFor: 'instance creation'!
+encoding: aWordArray
+ "Answer a pipe of encoders that converts
aWordArray (which contains
+ Integers for the Unicode values) to the current
locale's default
+ charset."
+ ^self
+ encoding: aWordArray
+ as: Locale default charset
+!
+
+encoding: aStringOrStream as: toEncoding
+ "Answer a pipe of encoders that converts
aWordArray (which contains
+ Integers for the Unicode values) to the supplied
encoding (which
+ can be an ASCII String or Symbol)."
+ | pivot to encoderTo pipe |
+
+ "Adopt an uniform naming"
+ to := toEncoding asString.
+ (from = 'UTF-32' or: [ from = 'UCS-4' ])
+ ifTrue: [ to := self pivotEncoding ].
+ (to = 'UTF-16' or: [ to = 'UCS-2' ])
+ ifTrue: [ to := self pivotEncoding copyReplacing: '32'
with: '16' ].
+
+ "If converting to the pivot encoding, we're
done."
+ pivot := 'UTF-32'.
+ ((to startsWith: 'UCS-4') or: [ to startsWith:
'UTF-32' ])
+ ifTrue: [ pivot := to ].
+ pivot = 'UTF-32' ifTrue: [ pivot := self
pivotEncoding ].
+
+ encoderTo := Iconv.
+ EncodersRegistry do: [ :each |
+ ((each at: 1) includes: to)
+ ifTrue: [ encoderTo := each at: 2 ]
+ ].
+
+ pipe := aStringOrStream.
+
+ "Split UTF-32 character codes into bytes if
needed by iconv."
+ encoderTo == Iconv ifTrue: [ pipe := self split: pipe
to: pivot ].
+
+ "If not converting to the pivot encoding, we need
one more step."
+ to = pivot ifFalse: [
+ pipe := encoderTo on: aStringOrStream from: pivot
to: toEncoding ].
+ ^pipe
+!
+
+unicodeOn: aStringOrStream
+ "Answer a pipe of encoders that converts
aStringOrStream (which can
+ be a string or another stream) from the current
locale's default
+ charset to integers representing Unicode character
codes."
+ ^self
+ unicodeOn: aStringOrStream
+ encoding: Locale default charset
+!
+
+unicodeOn: aStringOrStream encoding: fromEncoding
+ "Answer a pipe of encoders that converts
aStringOrStream
+ (which can be a string or another stream) from the
supplied
+ encoding (which can be an ASCII String or Symbol) to
+ integers representing Unicode character codes."
+ | from pivot encoderFrom pipe |
+
+ "Adopt an uniform naming"
+ from := fromEncoding asString.
+ (from = 'UTF-32' or: [ from = 'UCS-4' ])
+ ifTrue: [ from := aStringOrStream utf32Encoding ].
+ (from = 'UTF-16' or: [ from = 'UCS-2' ])
+ ifTrue: [ from := aStringOrStream utf16Encoding ].
+
+ pivot := 'UTF-32'.
+ ((from startsWith: 'UCS-4') or: [ from startsWith:
'UTF-32' ])
+ ifTrue: [ pivot := from ].
+ pivot = 'UTF-32' ifTrue: [ pivot := self
pivotEncoding ].
+
+ encoderFrom := Iconv.
+ EncodersRegistry do: [ :each |
+ ((each at: 1) includes: from)
+ ifTrue: [ encoderFrom := each at: 2 ]
+ ].
+
+ pipe := aStringOrStream.
+
+ "If not converting from the pivot encoding, we
need one more step."
+ from = pivot ifFalse: [
+ pipe := encoderFrom on: aStringOrStream from:
fromEncoding to: pivot ].
+
+ "Compose iconv-produced bytes into UTF-32
character codes if needed."
+ encoderFrom == Iconv ifTrue: [ pipe := self compose:
pipe from: pivot ].
+ ^pipe
+!
+
on: aStringOrStream from: fromEncoding
"Answer a pipe of encoders that converts
aStringOrStream
(which can be a string or another stream) from the
given
 -340,8
+430,20 
"Adopt an uniform naming"
from := fromEncoding asString.
to := toEncoding asString.
- from = 'UCS-4' ifTrue: [ from := 'UCS-4BE' ].
- to = 'UCS-4' ifTrue: [ to := 'UCS-4BE' ].
+ (from = 'UTF-32' or: [ from = 'UCS-4' ])
+ ifTrue: [ from := aStringOrStream utf32Encoding ].
+ (from = 'UTF-16' or: [ from = 'UCS-2' ])
+ ifTrue: [ from := aStringOrStream utf16Encoding ].
+ (to = 'UTF-32' or: [ to = 'UCS-4' ])
+ ifTrue: [ to := self pivotEncoding ].
+ (to = 'UTF-16' or: [ to = 'UCS-2' ])
+ ifTrue: [ to := self pivotEncoding copyReplaceAll: '32'
with: '16' ].
+
+ ((from startsWith: 'UCS-4') or: [ from startsWith:
'UTF-32' ])
+ ifTrue: [ pivot := from ].
+ ((to startsWith: 'UCS-4') or: [ to startsWith:
'UTF-32' ])
+ ifTrue: [ pivot := to ].
+ pivot = 'UTF-32' ifTrue: [ pivot := self
pivotEncoding ].
encoderFrom := encoderTo := Iconv.
EncodersRegistry do: [ :each |
 -358,12
+460,12 
"Else answer a `pipe' that takes care of
triangulating.
There is an additional complication: Smalltalk
encoders
read or provide a stream of character codes
(respectively
- if the source is UCS-4, or the target is UCS-4), while
iconv
+ if the source is UTF-32, or the target is UTF-32),
while iconv
expects raw bytes. So we add an intermediate layer if
a mixed Smalltalk+iconv conversion is done: it
converts
- character codes --> bytes (SplitUCS4xx, used if
iconv will
- convert from UCS-4) or bytes --> character code
(ComposeUCS4xx,
- used if iconv will convert to UCS-4).
+ character codes --> bytes (SplitUTF32xx, used if
iconv will
+ convert from UTF-32) or bytes --> character code
(ComposeUTF32xx,
+ used if iconv will convert to UTF-32).
There are five different cases (remember that at least
one converter
is not iconv, so `both use iconv' and `from = pivot =
to' are banned):
 -373,7
+475,6 
from uses iconv --> iconv + Compose + non-iconv
(implies to ~= pivot)
none uses iconv --> non-iconv + non-iconv (implies
neither = pivot)"
- pivot := self pivotEncoding.
pipe := aStringOrStream.
from = pivot
ifFalse: [
 -382,16
+483,16 
pipe := encoderFrom on: pipe from: fromEncoding to:
pivot.
encoderTo == Iconv ifTrue: [
- pipe := self split: pipe.
+ pipe := self split: pipe to: pivot.
"Check if we already reached the destination
format."
to = pivot ifTrue: [ ^pipe ].
].
].
- "Compose iconv-produced bytes into UCS-4
character codes if needed."
+ "Compose iconv-produced bytes into UTF-32
character codes if needed."
encoderFrom == Iconv ifTrue: [
- pipe := self compose: pipe
+ pipe := self compose: pipe from: pivot
].
^encoderTo on: pipe from: pivot to: toEncoding.
 -399,7
+500,7 
Namespace current: Smalltalk.I18N.Encoders!
-!FromUCS4 methodsFor: 'stream operation'!
+!FromUTF32 methodsFor: 'stream operation'!
species
"We answer a string of Characters encoded in our
destination
 -407,15
+508,15 
^String
! !
-!ToUCS4 methodsFor: 'stream operation'!
+!ToUTF32 methodsFor: 'stream operation'!
species
- "We answer a WordArray of UCS4 characters encoded
as a series of
+ "We answer a WordArray of UTF-32 characters
encoded as a series of
32-bit Integers."
^WordArray
! !
-!ComposeUCS4LE methodsFor: 'stream operation'!
+!ComposeUTF32LE methodsFor: 'stream operation'!
next
"Answer a 32-bit integer obtained by reading four
8-bit character
 -426,7
+527,7 
(self nextInput asInteger bitShift: 24)
! !
-!ComposeUCS4BE methodsFor: 'stream operation'!
+!ComposeUTF32BE methodsFor: 'stream operation'!
next
"Answer a 32-bit integer obtained by reading four
8-bit character
 -439,7
+540,7 
self nextInput asInteger
! !
-!SplitUCS4LE methodsFor: 'stream operation'!
+!SplitUTF32LE methodsFor: 'stream operation'!
atEnd
"Answer whether the receiver can produce more
characters"
 -474,7
+575,7 
wch := 1
! !
-!SplitUCS4BE methodsFor: 'stream operation'!
+!SplitUTF32BE methodsFor: 'stream operation'!
atEnd
"Answer whether the receiver can produce more
characters"
 -670,7
+771,7 
!ToUTF7 class methodsFor: 'initialization'!
initialize
- "Initialize the tables used by the UCS4-to-UTF7
converter"
+ "Initialize the tables used by the
UTF-32-to-UTF-7 converter"
Base64Characters := #[
16r00 16r00 16r00 16r00 16r00 16rA8 16rFF 16r03
 -806,7
+907,7 
!FromUTF7 class methodsFor: 'initialization'!
initialize
- "Initialize the tables used by the UTF7-to-UCS4
converter"
+ "Initialize the tables used by the
UTF-7-to-UTF-32 converter"
FromBase64 := #[
62 99 99 99 63
 -842,7
+943,7 
!FromUTF7 methodsFor: 'converting'!
atEnd
- "Answer whether the receiver can produce another
UCS4 32-bit
+ "Answer whether the receiver can produce another
UTF-32 32-bit
encoded integer"
^lookahead isNil
!
 -878,7
+979,7 
"Flush any remaining state left in the encoder by
the last character
(this is because UTF-7 encodes 6 bits at a time, so it
takes three
characters before it can provide a single 16-bit
character and
- up to six characters before it can provide a full
UCS-4 character)."
+ up to six characters before it can provide a full
UTF-32 character)."
shift := 0.
lookahead := self getNext.
! !
 -977,16
+1078,42 
description
"Answer a textual description of the
exception."
^'unknown encoding specified'! !
-
+
"Now add some extensions to the system classes"
-(CharacterArray classPool includesKey: #DefaultEncoding)
- ifFalse: [ CharacterArray addClassVarName:
#DefaultEncoding ]!
-
!CharacterArray class methodsFor: 'converting'!
defaultEncoding
+ self subclassResponsibility!
+
+!CharacterArray methodsFor: 'converting'!
+
+encoding
+ "Answer the encoding of the receiver, assuming it
is in the
+ default locale's default charset"
+
+ self class defaultEncoding asString = 'UTF-16'
+ ifTrue: [ ^self utf16Encoding ].
+ self class defaultEncoding asString = 'UTF-32'
+ ifTrue: [ ^self utf32Encoding ].
+ ^self class defaultEncoding!
+
+utf16Encoding
+ "Answer the encoding of the receiver, assuming
it's UTF-16"
+ ^Memory bigEndian ifTrue: [ 'UTF-16BE' ] ifFalse: [
'UTF-16LE' ]!
+
+utf32Encoding
+ "Answer the encoding of the receiver, assuming
it's UTF-32"
+ ^Memory bigEndian ifTrue: [ 'UTF-32BE' ] ifFalse: [
'UTF-32LE' ]! !
+
+
+(String classPool includesKey: #DefaultEncoding)
+ ifFalse: [ String addClassVarName: #DefaultEncoding ]!
+
+!String class methodsFor: 'converting'!
+
+defaultEncoding
"Answer the default locale's default
charset"
DefaultEncoding isNil
ifTrue: [ DefaultEncoding := Locale default charset ].
 -999,15
+1126,28 
DefaultEncoding := aString
! !
-!CharacterArray methodsFor: 'converting'!
+!String methodsFor: 'converting'!
-encoding
- "Answer the encoding of the receiver, assuming it
is in the
- default locale's default charset"
+asUnicode
+ "Return a WordArray with the contents of the
receiver, interpreted
+ as the default locale character set."
+ ^(EncodedStream unicodeOn: self) contents!
+
+asUnicode: aString
+ "Return a WordArray with the contents of the
receiver, interpreted
+ as the default locale character set."
+ ^(EncodedStream unicodeOn: self encoding: aString)
contents! !
+
+!String methodsFor: 'converting'!
+
+utf32Encoding
+ "Assuming the receiver is encoded as UTF-16 with
a proper
+ endianness marker, answer the correct encoding of the
receiver."
- ^self class defaultEncoding asString = 'UTF-16'
- ifTrue: [ self utf16Encoding ]
- ifFalse: [ self class defaultEncoding ]
+ | b1 b2 bigEndian |
+ b1 := self at: 1. "Low byte"
+ bigEndian := b1 = 0.
+ ^bigEndian ifTrue: [ 'UTF-32BE' ] ifFalse: [
'UTF-32LE' ]
!
utf16Encoding
 -1026,12
+1166,53 
^bigEndian ifTrue: [ 'UTF-16BE' ] ifFalse: [
'UTF-16LE' ]
! !
+
+!WordArray class methodsFor: 'converting'!
+
+defaultEncoding
+ ^'UTF-32'! !
+
+!WordArray methodsFor: 'converting'!
+
+utf16Encoding
+ self shouldNotImplement! !
+
+!WordArray methodsFor: 'converting'!
+
+encoded
+ "Return a String with the contents of the
receiver, converted
+ into the default locale character set."
+ ^(EncodedStream encoding: self) contents!
+
+encodedAs: aString
+ "Return a String with the contents of the
receiver, converted
+ into the aString locale character set."
+ ^(EncodedStream encoding: self as: aString) contents!
+
+
!PositionableStream methodsFor: 'converting'!
encoding
"Answer the encoding of the underlying
collection"
- ^collection encoding
-! !
+ ^collection encoding!
+
+utf16Encoding
+ "Answer the encoding of the underlying
collection, assuming it's UTF-16"
+ ^collection utf16Encoding!
+
+utf32Encoding
+ "Answer the encoding of the underlying
collection, assuming it's UTF-32"
+ ^collection utf32Encoding! !
+
+!Stream methodsFor: 'converting'!
+
+utf16Encoding
+ "Answer the encoding of the underlying
collection, assuming it's UTF-16"
+ ^Memory bigEndian ifTrue: [ 'UTF-16BE' ] ifFalse: [
'UTF-16LE' ]!
+
+utf32Encoding
+ "Answer the encoding of the underlying
collection, assuming it's UTF-32"
+ ^Memory bigEndian ifTrue: [ 'UTF-32BE' ] ifFalse: [
'UTF-32LE' ]! !
Encoders.ToUTF7 initialize!
Encoders.FromUTF7 initialize!
_______________________________________________
help-smalltalk mailing list
help-smalltalk gnu.org
http://lists.gnu.org/mailman/listinfo/help-smalltalk
|
> I'm working on it in my spare time, I attach my
current prototype patch.
I have almost completed this, it's only about 400 lines of
new code,
mostly in i18n/Sets.st. I have defined a new UnicodeString
class, and
modified Character to have support for characters whose
Unicode code
point is > 255. For ease of testing and usage, also,
I've defined a
syntax $<279> that allows you to refer to a Character
by its ASCII
value. It's equivalent to "279 asCharacter" --
I could have instead
inlined this at compile-time, but I prefer to have also a
more compact
syntax.
The changes are mostly backwards compatible, but characters
should *not*
be compared with ==, but with = unless you're sure the code
point is <=
255. Similarly, they should *not* be printed with nextPut:,
but with
display:, unless you're sure the code point is <= 127.
What follows is some use cases. This is in a UTF-8 locale
but (subject
to the capabilities of your system's iconv function) it
works as well
for every other locale.
I am not very expert in the *needs* of people using Unicode,
so can you
please confirm that it is (close to) what you need? In
particular, I'd
like feedback on what to do when in transcoding is not
enabled, because
right now the behavior is inconsistent: see the notes
preceded by ***.
Without the I18N package, the behavior is not complete and
you can
store, but not print Unicode characters correctly:
Printing a Unicode character:
st> $<279> printNl!
$<16r0117>
Converting a Unicode character to String:
*** maybe should consider returning '?'
st> $<279> asString printNl!
error: Invalid argument <16r0117>: argument must be
between $<0> and
$<16r00FF>
Converting a Unicode character to a UTF-32 String:
st> ($<279> asUnicodeString) printNl!
'<16r0117>'
Converting a UTF-32 String with a Unicode character to a
byte-encoded
String:
*** maybe should give an error instead
st> $<279> asUnicodeString asString printNl!
'?'
Asking the number of characters to the resulting Strings:
st> $<279> asUnicodeString numberOfCharacters
printNl!
1
st> $<279> asUnicodeString asString
numberOfCharacters printNl!
error: should not be implemented in this class
Converting ByteArrays or Strings to UnicodeStrings:
st> #[196 151] asUnicodeString first printNl!
error: should not be implemented in this class
-----
After loading the I18N package, everything is much better:
Printing a Unicode character:
st> $<279> printNl!
$ė
Converting a Unicode character to String:
st> $<279> asString printNl!
'ė'
Converting a Unicode character to a UTF-32 String, and then
back just by
printing it:
st> ($<279> asUnicodeString) printNl!
'ė'
Converting a UTF-32 String with a Unicode character to a
byte-encoded
String:
st> $<279> asUnicodeString asString printNl!
'ė'
Asking the number of characters to the resulting Strings:
st> $<279> asUnicodeString numberOfCharacters
printNl!
1
st> $<279> asUnicodeString asString
numberOfCharacters printNl!
1
Converting ByteArrays or Strings to UnicodeStrings:
st> #[196 151] asUnicodeString first printNl!
$ė
st> #[196 151] asUnicodeString size printNl!
1
st> #[196 151] asUnicodeString numberOfCharacters
printNl!
1
Paolo
_______________________________________________
help-smalltalk mailing list
help-smalltalk gnu.org
http://lists.gnu.org/mailman/listinfo/help-smalltalk
|