List Info

Thread: Bug in EncodedStream?




Bug in EncodedStream?
user name
2006-10-16 07:43:39
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

When I run following:

(I18N.EncodedStream encoding: (UnicodeString fromString:
'전성진'))
contents !

gst emits endless messages related to garbage collecting
then crashes
with segmentation faults. The content of the string is UTF-8
encoded
Korean text(9 byte, 3 characters).

And, are there any simple example for processing UTF-8
encoded string?

Thanks in advance.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFFMzgrQqspS1+XJHgRAm4bAJoCCY4J1SiT6yloR54qlIcjeoIplgCe
Iy3t
JoLjMRkAijV6ZoxBI+exYV4=
=kVE1
-----END PGP SIGNATURE-----


_______________________________________________
help-smalltalk mailing list
help-smalltalkgnu.org

http://lists.gnu.org/mailman/listinfo/help-smalltalk
Bug in EncodedStream?
user name
2006-10-16 08:21:29
Sungjin Chun wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> When I run following:
>
> (I18N.EncodedStream encoding: (UnicodeString
fromString: '전성진'))
> contents !
>
> gst emits endless messages related to garbage
collecting then crashes
> with segmentation faults.
Yes, it is a stupid bug.  When using the system function
iconv, gst has 
to split the UnicodeCharacters back into 8-bit Characters,
and here it 
gets stuck in an infinite loop.  The first character for
example is 
$<16rC804>, and the "C8" byte is created as
a UnicodeCharacter rather 
than a Character.  This causes a recursive creation of
another 
I18N.EncodedStream.

The attached patch fixes the bug; thanks for reporting it.

In my testing, I only used Eastern-European characters where
all bytes 
are < 0x80.
> And, are there any simple example for processing UTF-8
encoded string?
>   
Can you expand?

Paolo
--- orig/i18n/Sets.st
+++ mod/i18n/Sets.st
 -718,13
+718,13  next
          been extracted."
 	wch := answer := self nextInput codePoint.
 	wch := (wch bitShift: -8) + 16r1000000.
-	^(answer bitAnd: 255) asCharacter
+	^Character value: (answer bitAnd: 255)
     ].
 
     "Answer any other byte"
     answer := wch bitAnd: 255.
     wch := wch bitShift: -8.
-    ^answer asCharacter
+    ^Character value: answer
 !
 
 flush
 -754,7
+754,7  next
 	wch := answer := self nextInput codePoint.
 	wch := wch bitAnd: 16rFFFFFF.
 	count := 3.
-	^(answer bitShift: -24) asCharacter
+	^Character value: (answer bitShift: -24)
     ].
 
     "Answer any other byte.  We keep things so that
the byte we answer
 -763,7
+763,7  next
     wch := wch bitAnd: 16rFFFF.
     wch := wch bitShift: 8.
     count := count - 1.
-    ^answer asCharacter
+    ^Character value: answer
 !
 
 flush
_______________________________________________
help-smalltalk mailing list
help-smalltalkgnu.org

http://lists.gnu.org/mailman/listinfo/help-smalltalk
Bug in EncodedStream?
user name
2006-10-16 10:25:49
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Paolo Bonzini wrote:

>> And, are there any simple example for processing
UTF-8 encoded string?
>>   
> Can you expand?
> 
> Paolo

I mean that I want example code which shows good pattern on
dealing
multibyte string  For
example, I'm not sure whether this code is good
or not:

str _ UnicodeString fromString: 'Some UTF-8 Encoded String'.

It seems that

str _ UnicodeString fromString: 'Some UTF-8 Encoded String'
encoding:
UTF8StringEncoding.

like code is better. (because I can let UnicodeString know
the exact
encoding of given string or array of bytes.)

Thanks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFFM14tQqspS1+XJHgRAtzXAJ98DIvDL40F++aV7qgRywYeQfo1MwCf
d4yz
G5F7YsjMIP4MCpLIkZy/o0M=
=1eL+
-----END PGP SIGNATURE-----


_______________________________________________
help-smalltalk mailing list
help-smalltalkgnu.org

http://lists.gnu.org/mailman/listinfo/help-smalltalk
{Spam?} Re: Bug in EncodedStream?
user name
2006-10-16 11:05:02
> I mean that I want example code which shows good
pattern on dealing
> multibyte string  For
example, I'm not sure whether this code is good
> or not:
>
> str _ UnicodeString fromString: 'Some UTF-8 Encoded
String'.
>   
It is if your default encoding is UTF-8, or if the encoded
string 
includes a byte-order mark (for this, you need the attached
patch :-( ...).

For example, this works:

st> #[254 255 200 4 193 49 201 196] asString encoding!
'UTF-16BE'
> str _ UnicodeString fromString: 'Some UTF-8 Encoded
String' encoding:
> UTF8StringEncoding.
>   
UTF8StringEncoding is written 'UTF-8'.

Paolo

* auto-adding bonzinignu.org--2004b/smalltalk--devo--2.2--patch-152 to
greedy revision library /Users/bonzinip/Archives/revlib
* found immediate ancestor revision in library (bonzinignu.org--2004b/smalltalk--devo--2.2--patch-151)
* patching for this revision (bonzinignu.org--2004b/smalltalk--devo--2.2--patch-152)
--- orig/i18n/Sets.st
+++ mod/i18n/Sets.st
 -1289,21
+1289,21  encoding
      default locale's default charset"
 
     | encoding |
-    (self size >= 4 and: [ (self at: 1) = 0 and: [ (self
at: 2) = 0 and: [
-    	(self at: 3) = 254 and: [
-    	(self at: 4) = 255 ]]]]) ifTrue: [ ^'UTF-32BE' ].
-    (self size >= 4 and: [ (self at: 4) = 0 and: [ (self
at: 3) = 0 and: [
-    	(self at: 2) = 254 and: [
-    	(self at: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
+    (self size >= 4 and: [ (self valueAt: 1) = 0 and: [
(self valueAt: 2) = 0 and: [
+    	(self valueAt: 3) = 254 and: [
+    	(self valueAt: 4) = 255 ]]]]) ifTrue: [ ^'UTF-32BE' ].
+    (self size >= 4 and: [ (self valueAt: 4) = 0 and: [
(self valueAt: 3) = 0 and: [
+    	(self valueAt: 2) = 254 and: [
+    	(self valueAt: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
     (self size >= 2 and: [
-    	(self at: 1) = 254 and: [
-    	(self at: 2) = 255 ]]) ifTrue: [ ^'UTF-16BE' ].
+    	(self valueAt: 1) = 254 and: [
+    	(self valueAt: 2) = 255 ]]) ifTrue: [ ^'UTF-16BE' ].
     (self size >= 2 and: [
-    	(self at: 2) = 254 and: [
-    	(self at: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
-    (self size >= 3 and: [ (self at: 1) = 16rEF and: [
-    	(self at: 2) = 16rBB and: [
-    	(self at: 3) = 16rBF ]]]) ifTrue: [ ^'UTF-8' ].
+    	(self valueAt: 2) = 254 and: [
+    	(self valueAt: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
+    (self size >= 3 and: [ (self valueAt: 1) = 16rEF
and: [
+    	(self valueAt: 2) = 16rBB and: [
+    	(self valueAt: 3) = 16rBF ]]]) ifTrue: [ ^'UTF-8' ].
 
     encoding := self class defaultEncoding.
     encoding asString = 'UTF-16' ifTrue: [ ^self
utf16Encoding ].
 -1314,9
+1314,9  utf32Encoding
     "Assuming the receiver is encoded as UTF-16 with a
proper
      endianness marker, answer the correct encoding of the
receiver."
 
-    (self size >= 4 and: [ (self at: 4) = 0 and: [ (self
at: 3) = 0 and: [
-    	(self at: 2) = 254 and: [
-    	(self at: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
+    (self size >= 4 and: [ (self valueAt: 4) = 0 and: [
(self valueAt: 3) = 0 and: [
+    	(self valueAt: 2) = 254 and: [
+    	(self valueAt: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
     ^'UTF-32BE'
 !
 
 -1325,8
+1325,8  utf16Encoding
      endianness marker, answer the correct encoding of the
receiver."
 
     (self size >= 2 and: [
-    	(self at: 2) = 254 and: [
-    	(self at: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
+    	(self valueAt: 2) = 254 and: [
+    	(self valueAt: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
     ^'UTF-16BE'
 ! !
 
_______________________________________________
help-smalltalk mailing list
help-smalltalkgnu.org

http://lists.gnu.org/mailman/listinfo/help-smalltalk
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )