List Info

Thread: New user, evaluating XML libraries




New user, evaluating XML libraries
user name
2006-12-19 17:23:03

Hello to all,

 

I’m a new user of libxml and new to XML in general.  I’ve been asked to evaluate XML libraries, preferably Open Source projects, for some things we want to do with XML in our products.  We provide an archival/retrieval system for medical records and images and we use XML for attaching metadata to the files we store.  We have some front-end UI components that make some use of XML but currently most of the work is done in the transport layer and the backend database components.  Due to the volume of data involved, efficiency and execution speed is a prime concern, though not necessarily an overriding one.  Most of the XML work being done now is with roll-your-own string processing.  Going forward we will need to be more sophisticated and standards-compliant.

 

Of the packages that turned up when I did a search, Xerces and libxml are the leading candidates.  I’ve downloaded, installed, built, and written test code for both and based on my findings, I’m leaning very heavily toward recommending libxml.  The person I report to has a very strong bias toward Xerces in general, and the W3C DOM standard in particular, as the hammer with which to pound all nails, even if the problem isn’t a nail. ; I’ve also received feedback from some of the users in the Xerces group and they make some points that I should at least consider.  What I’d like to do is present my reasons for recommending libxml, given the job we need to do as described above, include some of the Xerces users̵7; comments, and hopefully get your thoughts as well. ; I like libxml because:

 

  • It’s fast, about 3x faster than Xerces in some fairly rudimentary tests
  • It supports XPath (one of our big requirements) on its own, Xerces requires a bolt-on component like Xalan or Pathan to do XPath.
  • Being written in C, it has a much simpler programming interface than XercesR17; C++ object model.  Nothing against C++, it’s my primary language and I like it, but the interface to Xerces is more complicated, perhaps unnecessarily so, than most of the C++ I’ve been exposed to.  To me, a simpler interface translates to better understanding by a wider range of programmers, faster up-and-running time, and potentially better, safer code.
  • It’s better documented.  In addition to the API reference manual, there̵7;s the let-me-walk-you-through-it tutorial, well documented sample code, and many pages of additional information on a variety of topics.  The information presented in all areas is more thorough.  Xerces has the Doxygen-generated ref. man., a Programming Guide (equivalent to the tutorial, but sparse by comparison), and some commented sample code.
  • (I may be mistaken about this, but…) for character encodings libxml uses a standard library (iconv) that is distributed with most versions of Linux and Unix (and has been ported to Win32), Xerces uses its own internal routines (?).
  • In addition to a DOM-like interface and SAX support, libxml has the XMLTextReader interface which I haven̵7;t tried yet, but I’m assuming is a fast efficient way to do simple XML queries.  Xerces has only DOM and SAX.

 

I’ve likened the use of big packages like Xerces for some of the things we need to “using a blowtorch to light a cigarette”.  Here is one response from a Xerces user:

 

Libxml is a great library with somewhat different goals than Xerces.  I

don't think it's explicitly stated on the Web site, but Xerces and other

projects that build on it tend to implement W3C standards (DOM, XML

Schema), while libxml implements what its maintainer prefers (a unique

API, RelaxNG), with a focus on efficiency.  Both approaches are

reasonable, and which is appropriate depends on your needs.

 

In your shoes, if I were certain that lighting a cigarette is all I

would ever need to do, I'd probably use libxml.  In my experience,

though, XML is useful for so many things that I'd probably want to be

prepared to bake, boil, weld, and power fighter jets as well - in a

variety of local languages. ; I'm a nut for portability, and a DOM

interface has the advantage of being similar or identical in a wide

range of environments (C++, C#, JavaScript, etc).

 

What about this? ; Is Xerces that much more powerful, as the writer suggests?  Is portability the only advantage to W3C-compliant interfaces like DOM?

 

And then this:

 

In cases where performance is critical, I think you'd be best off

avoiding XPath altogether. (snip) An optimal Xerces SAX parser might well be more efficient than

libxml parsing + XPath evaluation.R21;

 

Finally:

 

“One big difference between Xerces-C++ and Libxml2 is that the latter

does not have a functional XML Schema validator. I don't know if it

is important to you or not. Also note that much of the speed-up of

Libxml2 compared to Xerces-C++ comes from the fact that Xerces-C++

uses 2-byte characters (UTF-16) while Libxml2 uses 1-byte characters

(UTF-8). Since most performance tests that I am aware of are done

on XML files that are either ASCII or UTF-8, Libxml2 has a natural

advantage here. This is also something to consider depending on the

type of applications you are planning to build.R21;

 

I’m unsure of the importance of an XML Schema validator so I can’t comment on this. ; I don’t think I agree with the comment about speed vis a vis UTF-8/16.  Encoding conversions using UTF-8 are more computationally intensive than UTF-16 so what you lose by moving around double the number bytes would, I think be offset by the greater CPU requirement for translating the data. ; Does XercesR17; use of UTF-16 provide support for a wider range of encodings and local languages?

 

I know this is rather long and I apologize in advance if it is too much so, but obviously there̵7;s a lot to be considered, this is a hefty decision, and I want to provide anybody who might be inclined to help with as much to go on as possible.  Thanks in advance for any responses,

 

-will

New user, evaluating XML libraries
user name
2006-12-19 18:03:52
On Tue, Dec 19, 2006 at 12:23:03PM -0500, Will Sappington
wrote:
> Hello to all,

  Hi, only answering to a few specific points, I maybe
biased otherwise

> *	(I may be mistaken about this, but...) for character
encodings
> libxml uses a standard library (iconv) that is
distributed with most
> versions of Linux and Unix (and has been ported to
Win32),

  it's slightly more complex, libxml2 uses its own routines
for UTF-8/UTF-16
and ISO Latin 1, in order to ensure it always work on the
mandatory encodings
but uses iconv when detected at build time which is the
standard and preferred
way on Unix/Linux.

> *	In addition to a DOM-like interface and SAX support,
libxml has
> the XMLTextReader interface which I haven't tried yet,
but I'm assuming
> is a fast efficient way to do simple XML queries. 
Xerces has only DOM
> and SAX.

  XMLTextReader is streaming, more convenient than SAX, but
a bit slower.

> I've likened the use of big packages like Xerces for
some of the things
> we need to "using a blowtorch to light a
cigarette".  Here is one
> response from a Xerces user:
> 
>  
> 
> "Libxml is a great library with somewhat different
goals than Xerces.  I
> 
> don't think it's explicitly stated on the Web site, but
Xerces and other
> 
> projects that build on it tend to implement W3C
standards (DOM, XML
> 
> Schema), while libxml implements what its maintainer
prefers (a unique
> 
> API, RelaxNG), with a focus on efficiency.  Both
approaches are
> 
> reasonable, and which is appropriate depends on your
needs.

  Le'ts be frank, it's a bit of FUD, Schemas is being
implemented, it's
not fully implemented because the spec is basically broken
beyond recovery.
I implement and believe in standard (I sit on the W3C XML
Core Working Group,
like IBM representatives) but standardizing APIs at W3C has
been a disaster
DOM is IMHO severely broken, SAX not formerly defined except
for Java. On
the other hand the XMLTextReader from C# is part of the ECMA
C# spec, and
is a good API.

> In your shoes, if I were certain that lighting a
cigarette is all I
> 
> would ever need to do, I'd probably use libxml.  In my
experience,
> 
> though, XML is useful for so many things that I'd
probably want to be
> 
> prepared to bake, boil, weld, and power fighter jets as
well - in a
> 
> variety of local languages.  I'm a nut for portability,
and a DOM
> 
> interface has the advantage of being similar or
identical in a wide
> 
> range of environments (C++, C#, JavaScript, etc)."
> 
>  
> 
> What about this?  Is Xerces that much more powerful, as
the writer
> suggests?  Is portability the only advantage to
W3C-compliant interfaces
> like DOM?

  Simple, DOM is not defined for C. There is no proper
binding, the 
result of trying to run the interface generator for C build
an heresy 
no-one sane want to work against. 
  Also DOM *requires* UTF-16 for all strings. This means
that in general
1/ you will loose time, most content around is UTF-8
2/ you will loose memory space/cache efficiency as the
converted output is
   way larger in average
3/ you will looose CPU efficiency as breaking cache is #1
performance 
   problem in modern computers
4/ most unixes APIs are fine with UTF-8 content, but working
with UTF-16 
   is *not* fun there, this is biased toward Windows
programming IMHO

> And then this:
> 
>  
> 
> "In cases where performance is critical, I think
you'd be best off
> 
> avoiding XPath altogether. (snip) An optimal Xerces SAX
parser might
> well be more efficient than
> 
> libxml parsing + XPath evaluation."

  If you can avoid XPath, sure it is never the most
efficient. But it's
certainly easier to write code with it and *maintain* said
code. The main
problem is that XPath somehow forces to use a tree, at least
in libxml2.
But see 
   http://xmlso
ft.org/xmlreader.html#Mixing

> Finally:
> 
>  
> 
> "One big difference between Xerces-C++ and Libxml2
is that the latter
> 
> does not have a functional XML Schema validator. I
don't know if it

  There is no functional XSD validator. Go to the
xmlschemas-dev
archive at W3C, check the last 5 questions from Michael Kay
(who is
a Schemas implementor and one of the W3C spec writers), they
are unanswered
for weeks now, nobody can tell what it is supposed to do.
Trying to use
XSD to promote interoperability or validation of data is
kind of a joke.
Relax-NG on the other hand is an ISO standard, has a formal
specification
and can be read and understood by most programmers in a
matter of a couple
of days. Pick your poison, sorry I can make the difference
between a bad 
technology and a good one, especially in that domain.

> I'm unsure of the importance of an XML Schema validator
so I can't
> comment on this.  I don't think I agree with the
comment about speed vis
> a vis UTF-8/16.  Encoding conversions using UTF-8 are
more
> computationally intensive than UTF-16 so what you lose
by moving around
> double the number bytes would, I think be offset by the
greater CPU
> requirement for translating the data.  Does Xerces' use
of UTF-16
> provide support for a wider range of encodings and
local languages?

  It's DOM and internal APIs which forces Xerces to UTF-16,
it's just a
very bad design decision which was done by IBM  and
Microsoft. The XML
spceification mandates that any compliant parser can process
both UTF-8
and UTF-16 inputs.

> I know this is rather long and I apologize in advance
if it is too much
> so, but obviously there's a lot to be considered, this
is a hefty
> decision, and I want to provide anybody who might be
inclined to help
> with as much to go on as possible.  Thanks in advance
for any responses,

 Here you are, I'm certainly a bit biased though I tried to
be honnest 

Daniel

-- 
Red Hat Virtualization group http://redhat.com/v
irtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillardredhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ |
Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
New user, evaluating XML libraries
user name
2006-12-21 20:19:27
Will, FWIW, here's my 2cents.

A few months back, I started a project using both Xerces-C++ and libxml (http://xmlnanny.com).

Initially, I was more attracted to Xerces (probably because it was C++/OO, which I prefered), so used Xerces wherever I could in my project (basic XML parsing, DTD and XSD validation), and supplemented with libxml for features Xerces didn't have (RNG, pull parsing).

Although Xerces is a wonderful toolkit, after having used both, I'm a much bigger fan of libxml. I've done several xml projects since then, and chosen libxml exclusively for each. I'm currently planning a rewrite of my original project and will be using libxml exclusively and removing Xerces. 

Some things I prefer about libxml that I couldn't get Xerces to do (maybe it was my fault, don't know for sure):

1. libxml allows you to dynamically specify a DTD for validation without having to hard code the Doctype in the source doc. AFAICT, you can't do this with XercesC.
2. dynamically specifying an XSD against which to validate a source document seems MUCH, MUCH easier with libxml.
3. I prefer libxml's error messages... they seem more complete and make more sense to humans (JMHO).
4. libxml has XmlReader (pull-parsing)
5. libxml has RNG support
6. libxml has XPath support
7. libxml has XInclude support (AFAICT, XercesC doesn't)

To be fair, I found the quality of documentation of the two products to be about equal.

I don't understand anyone choosing XercesC because it has DOM support... everyone knows DOM is a pretty crap API, and libxml has it's own tree api that can do the exact same things, more or less. Who cares if it's technically 'DOM' or not? I don't get it.

I hope you give libxml a fair shot... screw the DOM ;)


Todd Ditchendorf

Scandalous Software - Cocoa Developer Tools


On Dec 19, 2006, at 9:23 AM, Will Sappington wrote:

Hello to all,

 

I’m a new user of libxml and new to XML in general.  I’ve been asked to evaluate XML libraries, preferably Open Source projects, for some things we want to do with XML in our products.  We provide an archival/retrieval system for medical records and images and we use XML for attaching metadata to the files we store.  We have some front-end UI components that make some use of XML but currently most of the work is done in the transport layer and the backend database components.  Due to the volume of data involved, efficiency and execution speed is a prime concern, though not necessarily an overriding one.  Most of the XML work being done now is with roll-your-own string processing.  Going forward we will need to be more sophisticated and standards-compliant.

 

Of the packages that turned up when I did a search, Xerces and libxml are the leading candidates.  I’ve downloaded, installed, built, and written test code for both and based on my findings, I’m leaning very heavily toward recommending libxml.  The person I report to has a very strong bias toward Xerces in general, and the W3C DOM standard in particular, as the hammer with which to pound all nails, even if the problem isn’t a nail.  I’ve also received feedback from some of the users in the Xerces group and they make some points that I should at least consider.  What I’d like to do is present my reasons for recommending libxml, given the job we need to do as described above, include some of the Xerces users’ comments, and hopefully get your thoughts as well.  I like libxml because:

 

  • It’s fast, about 3x faster than Xerces in some fairly rudimentary tests
  • It supports XPath (one of our big requirements) on its own, Xerces requires a bolt-on component like Xalan or Pathan to do XPath.
  • Being written in C, it has a much simpler programming interface than Xerces’ C++ object model.  Nothing against C++, it’s my primary language and I like it, but the interface to Xerces is more complicated, perhaps unnecessarily so, than most of the C++ I’ve been exposed to.  To me, a simpler interface translates to better understanding by a wider range of programmers, faster up-and-running time, and potentially better, safer code.
  • It’s better documented.  In addition to the API reference manual, there’s the let-me-walk-you-through-it tutorial, well documented sample code, and many pages of additional information on a variety of topics.  The information presented in all areas is more thorough.  Xerces has the Doxygen-generated ref. man., a Programming Guide (equivalent to the tutorial, but sparse by comparison), and some commented sample code.
  • (I may be mistaken about this, but…) for character encodings libxml uses a standard library (iconv) that is distributed with most versions of Linux and Unix (and has been ported to Win32), Xerces uses its own internal routines (?).
  • In addition to a DOM-like interface and SAX support, libxml has the XMLTextReader interface which I haven’t tried yet, but I’m assuming is a fast efficient way to do simple XML queries.  Xerces has only DOM and SAX.

 

I’ve likened the use of big packages like Xerces for some of the things we need to “using a blowtorch to light a cigarette”.  Here is one response from a Xerces user:

 

Libxml is a great library with somewhat different goals than Xerces.  I

don't think it's explicitly stated on the Web site, but Xerces and other

projects that build on it tend to implement W3C standards (DOM, XML

Schema), while libxml implements what its maintainer prefers (a unique

API, RelaxNG), with a focus on efficiency.  Both approaches are

reasonable, and which is appropriate depends on your needs.

 

In your shoes, if I were certain that lighting a cigarette is all I

would ever need to do, I'd probably use libxml.  In my experience,

though, XML is useful for so many things that I'd probably want to be

prepared to bake, boil, weld, and power fighter jets as well - in a

variety of local languages.  I'm a nut for portability, and a DOM

interface has the advantage of being similar or identical in a wide

range of environments (C++, C#, JavaScript, etc).

 

What about this?  Is Xerces that much more powerful, as the writer suggests?  Is portability the only advantage to W3C-compliant interfaces like DOM?

 

And then this:

 

In cases where performance is critical, I think you'd be best off

avoiding XPath altogether. (snip) An optimal Xerces SAX parser might well be more efficient than

libxml parsing + XPath evaluation.”

 

Finally:

 

“One big difference between Xerces-C++ and Libxml2 is that the latter

does not have a functional XML Schema validator. I don't know if it

is important to you or not. Also note that much of the speed-up of

Libxml2 compared to Xerces-C++ comes from the fact that Xerces-C++

uses 2-byte characters (UTF-16) while Libxml2 uses 1-byte characters

(UTF-8). Since most performance tests that I am aware of are done

on XML files that are either ASCII or UTF-8, Libxml2 has a natural

advantage here. This is also something to consider depending on the

type of applications you are planning to build.”

 

I’m unsure of the importance of an XML Schema validator so I can’t comment on this.  I don’t think I agree with the comment about speed vis a vis UTF-8/16.  Encoding conversions using UTF-8 are more computationally intensive than UTF-16 so what you lose by moving around double the number bytes would, I think be offset by the greater CPU requirement for translating the data.  Does Xerces’ use of UTF-16 provide support for a wider range of encodings and local languages?

 

I know this is rather long and I apologize in advance if it is too much so, but obviously there’s a lot to be considered, this is a hefty decision, and I want to provide anybody who might be inclined to help with as much to go on as possible.  Thanks in advance for any responses,

 

-will

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org">xmlgnome.org


[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )