Hi,
In message "Re: [ ruby-Bugs-5711 ] REXML fails to
parse UTF-16 XML."
on Mon, 11 Sep 2006 01:25:58 +0900, <noreply rubyforge.org> writes:
|REXML fails to parse some XML documents written in UTF-16.
REXML is converting body twice, once from initialize, one
more from
XMLDECL_START. I made a patch. If Sean Russel accept it,
it would be
merged into 1.8.
Changes:
* Encoding#encoding= to return boolean value to tell if
the body is
really converted or not.
* Specific conversion library (e.g.
rexml/encodings/UTF-16.rb) to
have higher preceding.
* UTF-16#decode_utf16 should work strings without BOM.
matz.
--- lib/rexml/encoding.rb 22 Aug 2006 15:25:43 -0000 1.10
+++ lib/rexml/encoding.rb 11 Sep 2006 02:36:44 -0000
 -26,17
+26,18  module REXML
$VERBOSE = false
- return if defined? encoding and enc == encoding
+ enc = enc.nil? ? nil : enc.upcase
+ return false if defined? encoding and enc == encoding
if enc and enc != UTF_8
- encoding = enc.upcase
- begin
- require 'rexml/encodings/ICONV.rb'
- Encoding.apply(self, "ICONV")
- rescue LoadError, Exception => err
- raise ArgumentError, "Bad encoding name
# encoding" unless encoding =~ /^[\w-]+$/
- encoding.untaint
- enc_file = File.join( "rexml",
"encodings", "# encoding.rb" )
- begin
- require enc_file
- Encoding.apply(self, encoding)
- rescue LoadError
- puts $!.message
+ encoding = enc
+ raise ArgumentError, "Bad encoding name # encoding" unless encoding =~ /^[\w-]+$/
+ encoding.untaint
+ enc_file = File.join( "rexml",
"encodings", "# encoding.rb" )
+ begin
+ require enc_file
+ Encoding.apply(self, encoding)
+ rescue LoadError, Exception
+ begin
+ require 'rexml/encodings/ICONV.rb'
+ Encoding.apply(self, "ICONV")
+ rescue LoadError => err
+ puts err.message
raise ArgumentError, "No decoder found
for encoding # encoding. Please install iconv."
 -52,2
+53,3  module REXML
end
+ true
end
Index: lib/rexml/source.rb
============================================================
=======
RCS file: /var/cvs/src/ruby/lib/rexml/source.rb,v
retrieving revision 1.9
diff -p -u -1 -r1.9 source.rb
--- lib/rexml/source.rb 22 Aug 2006 15:25:43 -0000 1.9
+++ lib/rexml/source.rb 11 Sep 2006 02:36:44 -0000
 -46,3
+46,3  module REXML
def encoding=(enc)
- super
+ return unless super
line_break = encode( '>' )
Index: lib/rexml/encodings/UTF-16.rb
============================================================
=======
RCS file: /var/cvs/src/ruby/lib/rexml/encodings/UTF-16.rb,v
retrieving revision 1.5
diff -p -u -1 -r1.5 UTF-16.rb
--- lib/rexml/encodings/UTF-16.rb 9 Apr 2005 17:03:32
-0000 1.5
+++ lib/rexml/encodings/UTF-16.rb 11 Sep 2006 02:36:44 -0000
 -18,5
+18,6  module REXML
def decode_utf16(str)
+ str = str[2..-1] if /^\376\377/ =~ str
array_enc=str.unpack('C*')
array_utf8 = []
- 2.step(array_enc.size-1, 2){|i|
+ 0.step(array_enc.size-1, 2){|i|
array_utf8 << (array_enc.at(i+1) +
array_enc.at(i)*0x100)
|