List Info

Thread: REXML fails to parse UTF-16 XML.




REXML fails to parse UTF-16 XML.
user name
2006-09-10 16:25:58
Bugs item #5711, was opened at 2006-09-11 01:25
You can respond by visiting: 
http://rubyforge.org/tracke
r/?func=detail&atid=1698&aid=5711&group_id=426

Category: Standard Library
Group: None
Status: Open
Resolution: None
Priority: 3
Submitted By: Masahiro Sakai (sakai)
Assigned to: Nobody (None)
Summary: REXML fails to parse UTF-16 XML.

Initial Comment:
REXML fails to parse some XML documents written in UTF-16.

% cat test-rexml.rb
require 'rexml/document'
#s = "\xfe\xff" +
Iconv.conv("utf-16be", "us-ascii",
'<?xml version="1.0"
encoding="utf-16"?><a />')
s = "\376\377\000<\000?\000x\000m\000l\000
\000v\000e\000r\000s\000i\000o\000n\000=\000\00010
00.0000000\\000
\000e\000n\000c\000o\000d\000i\000n\000g\000=\000
00u000t000f000-00010006000\\000?\000>\000<\000a
\000 \000/\000>"
REXML:ocument.
new(s)

% ruby-19 -v test-rexml.rb
ruby 1.9.0 (2006-09-10) [i686-linux]
/usr/local/lib/ruby/1.9/rexml/parsers/treeparser.rb:89:in
`REXML::Parsers::TreeParser#parse':
#<Iconv::InvalidCharacter:
"\346\204\274\342\274\240",
[">"]> (REXML::ParseException)
/usr/local/lib/ruby/1.9/rexml/encodings/ICONV.rb:7:in
`Iconv#conv'
/usr/local/lib/ruby/1.9/rexml/encodings/ICONV.rb:7:in
`decode_iconv'
/usr/local/lib/ruby/1.9/rexml/source.rb:50:in
`REXML::Source#encoding='
/usr/local/lib/ruby/1.9/rexml/parsers/baseparser.rb:210:in
`REXML::Parsers::BaseParser#pull'
/usr/local/lib/ruby/1.9/rexml/parsers/treeparser.rb:21:in
`REXML::Parsers::TreeParser#parse'
/usr/local/lib/ruby/1.9/rexml/document.rb:190:in `build'
/usr/local/lib/ruby/1.9/rexml/document.rb:45:in
`initialize'
test-rexml.rb:4:in `Class#new'
test-rexml.rb:4
...
">"
Line: 
Position: 
Last 80 unconsumed characters:
<a />	from
/usr/local/lib/ruby/1.9/rexml/document.rb:190:in `build'
	from /usr/local/lib/ruby/1.9/rexml/document.rb:45:in
`initialize'
	from test-rexml.rb:4:in `Class#new'
	from test-rexml.rb:4


------------------------------------------------------------
----------

You can respond by visiting: 
http://rubyforge.org/tracke
r/?func=detail&atid=1698&aid=5711&group_id=426

REXML fails to parse UTF-16 XML.
user name
2006-09-11 02:45:29
Hi,

In message "Re: [ ruby-Bugs-5711 ] REXML fails to
parse UTF-16 XML."
    on Mon, 11 Sep 2006 01:25:58 +0900, <noreplyrubyforge.org> writes:

|REXML fails to parse some XML documents written in UTF-16.

REXML is converting body twice, once from initialize, one
more from
XMLDECL_START.  I made a patch.  If Sean Russel accept it,
it would be
merged into 1.8.

Changes:

  * Encoding#encoding= to return boolean value to tell if
the body is
    really converted or not.
  * Specific conversion library (e.g.
rexml/encodings/UTF-16.rb) to
    have higher preceding.
  * UTF-16#decode_utf16 should work strings without BOM.

							matz.

--- lib/rexml/encoding.rb	22 Aug 2006 15:25:43 -0000	1.10
+++ lib/rexml/encoding.rb	11 Sep 2006 02:36:44 -0000
 -26,17
+26,18  module REXML
         $VERBOSE = false
-        return if defined? encoding and enc == encoding
+				enc = enc.nil? ? nil : enc.upcase
+        return false if defined? encoding and enc == encoding
         if enc and enc != UTF_8
-          encoding = enc.upcase
-          begin
-            require 'rexml/encodings/ICONV.rb'
-            Encoding.apply(self, "ICONV")
-          rescue LoadError, Exception => err
-            raise ArgumentError, "Bad encoding name
#encoding" unless encoding =~ /^[\w-]+$/
-            encoding.untaint 
-            enc_file = File.join( "rexml",
"encodings", "#encoding.rb" )
-            begin
-              require enc_file
-              Encoding.apply(self, encoding)
-            rescue LoadError
-              puts $!.message
+					encoding = enc
+					raise ArgumentError, "Bad encoding name #encoding" unless encoding =~ /^[\w-]+$/
+					encoding.untaint 
+					enc_file = File.join( "rexml",
"encodings", "#encoding.rb" )
+					begin
+						require enc_file
+						Encoding.apply(self, encoding)
+          rescue LoadError, Exception
+						begin
+							require 'rexml/encodings/ICONV.rb'
+							Encoding.apply(self, "ICONV")
+            rescue LoadError => err
+              puts err.message
               raise ArgumentError, "No decoder found
for encoding #encoding.  Please install iconv."
 -52,2
+53,3  module REXML
       end
+			true
     end
Index: lib/rexml/source.rb
============================================================
=======
RCS file: /var/cvs/src/ruby/lib/rexml/source.rb,v
retrieving revision 1.9
diff -p -u -1 -r1.9 source.rb
--- lib/rexml/source.rb	22 Aug 2006 15:25:43 -0000	1.9
+++ lib/rexml/source.rb	11 Sep 2006 02:36:44 -0000
 -46,3
+46,3  module REXML
 		def encoding=(enc)
-			super
+			return unless super
 			line_break = encode( '>' )
Index: lib/rexml/encodings/UTF-16.rb
============================================================
=======
RCS file: /var/cvs/src/ruby/lib/rexml/encodings/UTF-16.rb,v
retrieving revision 1.5
diff -p -u -1 -r1.5 UTF-16.rb
--- lib/rexml/encodings/UTF-16.rb	9 Apr 2005 17:03:32
-0000	1.5
+++ lib/rexml/encodings/UTF-16.rb	11 Sep 2006 02:36:44 -0000
 -18,5
+18,6  module REXML
     def decode_utf16(str)
+      str = str[2..-1] if /^\376\377/ =~ str
       array_enc=str.unpack('C*')
       array_utf8 = []
-      2.step(array_enc.size-1, 2){|i| 
+      0.step(array_enc.size-1, 2){|i| 
         array_utf8 << (array_enc.at(i+1) +
array_enc.at(i)*0x100)

REXML fails to parse UTF-16 XML.
user name
2006-09-11 16:01:58
On Sunday 10 September 2006 22:45, Yukihiro Matsumoto wrote:
> REXML is converting body twice, once from initialize,
one more from
> XMLDECL_START.  I made a patch.  If Sean Russel accept
it, it would
> be merged into 1.8.

Please do.

For the record, this is ticket:63 in REXML's trac:

	http://www.germane-software.com/projects/rexml/ticket/63


and is slated to be fixed in REXML 3.1.6.  After I close
that bug out, 
I'll merge it into Ruby CVS HEAD.

Thanks.

--- SER
REXML fails to parse UTF-16 XML.
user name
2006-09-11 16:59:30
Hi,

In message "Re: [ ruby-Bugs-5711 ] REXML fails to
parse UTF-16 XML."
    on Tue, 12 Sep 2006 01:01:58 +0900, Sean Russell
<sergermane-software.com> writes:

n
Sunday 10 September 2006 22:45, Yukihiro Matsumoto wrote:
|> REXML is converting body twice, once from initialize,
one more from
|> XMLDECL_START.  I made a patch.  If Sean Russel accept
it, it would
|> be merged into 1.8.
|
|Please do.

Shall I merge the patch into 1.8?  Or will you?

							matz.

REXML fails to parse UTF-16 XML.
user name
2006-09-11 19:30:58
On Monday 11 September 2006 12:59, Yukihiro Matsumoto wrote:
...
> |> from XMLDECL_START.  I made a patch.  If Sean
Russel accept it, it
> |> would be merged into 1.8.
> |
> |Please do.
>
> Shall I merge the patch into 1.8?  Or will you?

If I do it, it will not get done until this weekend.  If it
needs to go 
in sooner, then you should merge it.

Otherwise, it doesn't matter to me.  I always check the
status of the 
CVS repository and cross merge changes.

--- SER

Confidentiality Notice
This e-mail (including any attachments) is intended only for
the recipients named above. It may contain confidential or
privileged information and should not be read, copied or
otherwise used by any other person. If you are not a named
recipient, please notify the sender of that fact and delete
the e-mail from your system.



REXML fails to parse UTF-16 XML.
user name
2006-09-11 19:47:52
Hi,

In message "Re: [ ruby-Bugs-5711 ] REXML fails to
parse UTF-16 XML."
    on Tue, 12 Sep 2006 04:30:58 +0900, Sean Russell
<sergermane-software.com> writes:

|> Shall I merge the patch into 1.8?  Or will you?
|
|If I do it, it will not get done until this weekend.  If it
needs to go 
|in sooner, then you should merge it.
|
therwise,
it doesn't matter to me.  I always check the status of the 
|CVS repository and cross merge changes.

There's no reason to hurry.  Probably it's better for you
to commit.
I'm no REXML expert.

							matz.

[1-6]

about | contact  Other archives ( Real Estate discussion Medical topics )