List Info

Thread: Need a good workaround for bug 1199 / FCKeditor-JTidy-XML issue




Need a good workaround for bug 1199 / FCKeditor-JTidy-XML issue
user name
2007-03-05 14:31:08
Hello all,
 
I am working on an OpenCms migration and am running into a known issue regarding FCKeditor and JTidy.  I am hoping someone can help me find a good workaround for my client.  The bug is (sort of) open in Bugzilla as #1199.  Variations appear on the OpenCms forums around the web, but with no long-term solution.
 
Problem:  As I understand it, if I create a custom XML resource including an HTML field, and I edit the HTML code using FCKeditor, the editor sends the code to JTidy for cleanup before it is marshaled into the XML and saved in the VFS.  Depending on the resource's content-encoding, HTML entities may be replaced with the "real" characters at several places in the process if the real char appears in the code page. ; Sometimes, however, something goes wrong and unescaped null characters are introduced, presumably a part of a multibyte character being written someplace it shouldn't be.  XML does not support embedded nulls even in a CDATA section, so any later attempt to unmarshal this document results in chaos.
 
The resulting bug is:
 
===================­============
Error Unmarshalling xml document failed.
Reason: Error on line 31 of document : An invalid XML character
(Unicode: 0x0) was found in the CDATA section. Nested exception: An
invalid XML character (Unicode: 0x0) was found in the CDATA section.
===================­;===========
 
As part of the discussion on bug #1199, Alexander Kandzior recommends using US-ASCII as the content encoding on files that might contain problematic characters.  This forces all nonprintables and non-ASCII characters to be escaped as entities, thus avoiding the problem in the FCKeditor-JTidy-XMLDom pipeline.
 
I've tried this and the good news is that it works for a basic "Edit" of an XML document.  Unfortunately, a new problem is introduced if I now select "Edit controlcode" on this document.  Several common characters such as left/right curly quotes and the copyright symbol become encoded as their numeric entities, e.g. ©  If I edit the XML sourcecode and save it, very bad things happen to these characters; they become over-encoded as sequences of (e.g.) � characters.  All is still OK until I make a later ;attempt to do a plain "Edit" of these documents using FCKEditor.  They are displayed as square rectangles indicating a totally unprintable character.  If I save the document, they appear that way in the published web page as well.
 
So to sum up, I can encode the document using UTF-8, but then I may not be able to edit the doc repeatedly and reliably using the OpenCms/FCKeditor (if it includes certain non-ASCII characters in unlucky places).  Alternately, I can encode the document using US-ASCII, but if those unlucky non-ASCII characters appear and were encoded as entities, I cannot later edit the doc using a plain "Edit Sourcecode".  As my site will be maintained by my client moving forward, I would very much like to find a reliable workaround to this problem so that any doc can be edited in either manner at any time. ; Otherwise, no matter which encoding I choose, sooner or later someone will edit the document with the "wrong" editor, and I will be receiving a panicked phone call from them.
 
I note that someone on the ML had offered a customized version of JTidy that seemed to sidestep this problem.  Did that prove to work, and if so would I be able to get a copy of the new library?  Alternately, has there been any other ;acceptable workarounds found?
 
Thanks in advance,
 
Nick Straguzzi
Chief Knowledge Architect
Credo Systems, LLC
nick.straguzzicredosystems.com">nick.straguzzicredosystems.com  |  www.credosystems.com
"An investment in knowledge pays the best interest" - Benjamin Franklin
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )