Hi,
After struggling to get to grips with Libxml2 and Python, I
figured that although I can't contribute much in the way of
code, I can have a crack at getting some useful
documentation up together.
I have put the first part up on my Wiki, if anyone would
care to review for accuracy - or help out where it is a bit
light on examples?
http://mikekneller.c
om/wiki/index.php?title=Getting_started_with_Libxml2_and_Pyt
hon_-_part_1
I realise that this is probably a bit n00b for most here,
but I would like to bring together workable examples from
the ground up, most of the other information I have read
assumes a level of knowledge I just didn't have when I
encountered the library for the first time.
For reference, I'll post the text here.
Cheers
Mike
=== Getting started with Libxml2 and Python - Part 1 ===
Overview
Getting to grips with Libxml2 and Python can be a
frustrating experience,
particularly as in-depth, accurate Python documentation is
hard to find
on the Web.
Many Python developers dislike the Libxml2 bindings, as they
are 'un-Pythonic'
and much too C-like. This however misses the point of
Libxml2. The point is that
this library is portable, mature, extremely full-featured
and *very* fast.
In the process of writing this tutorial, I hung out in the
#xml channel on
irc.gnome.org, and subscribed to the xml gnome.org
mailing list - I
was given a lot of help when things weren't obvious!
Although there's not a massive
amount of activity on IRC, or in the mailing list on a daily
basis, I would
definitely recommend spending some time browsing the archive
- or using Google
to search it when you have questions. Additionally, I have
found the people in
the Libxml2 community very helpful.
Manipulating XML using Libxml2 is fairly straightforward
when you have a couple
of working examples, however that tends to be the problem in
Python. Finding
working examples tends to be a bit of a hit-and-miss
affair.
The first place to look is in the examples folder in the
documentation installed
with your release
(/usr/share/doc/libxml2-python-2.6.27/examples on my
machine).
TODO: where are the examples on a number of
distributions/platforms?
Also, take a moment to scan through libxml2.py itself - this
is the Python wrapper and
is a good place to look if you are hunting for a particular
function. There
is plenty of information in the wrapper as all the
docstrings have been
populated, you can always get information like
print libxml2.parseFile.__doc__
for any particular function.
Also remember that you can list the available methods for
any Python object by
using the dir function. The most immediately useful objects
are xmlCore, xmlNode
xmlDoc, so
dir(libxml2.xmlCore)
is your friend when working out what functions are available
to you.
I'm going to assume that you know a bit about XML, at least
enough to recognise
an XML document when you see one, and hopefully enough about
Python to know
where to find the documentation!
[installing Libxml2]
TODO: installation examples for a number of
distros/platforms.
[Loading a document]
The first thing you want to do in XML will be to load a
document of some sort.
As a new Libxml2 user, this is where our confusion starts!
It is worth remembering
that in general, the Python bindings are automatically
generated - therefore
there is an equivalent Python function for every C function,
and sometimes this
can lead to unnecessary, or apparently duplicated Python
functions.
The library contains a number of different functions we can
use to load an XML
document:
parseDoc, parseFile, parseMemory, readDoc, readFd,
readFile, readMemory,
recoverDoc and recoverFile
All of these functions return an xmlDoc object. Examples for
using each of these
follow:
parseDoc(cur) - load an XML document from memory (a
string)
doc = libxml2.parseDoc("""<?xml
version="1.0"?>
<root>Hello world!</root>""")
parseMemory(buffer, size) - load an XML document from
memory
doc = libxml2.parseMemory(xml, len(xml))
This function performs exactly the same job as parseDoc from
a Python perspective.
parseFile(filename) - load an XML document from a file
doc = libxml2.parseFile('test.xml')
readDoc(cur, URL, encoding, options) - load an XML document
from memory (a string)
This version of the function allows you to specify options
on a per-document
basis. The parseDoc version uses the parser defaults (in
practice, the
parser global settings, which can also be modified using
global functions).
In most cases,
doc = libxml2.readDoc('<foo/>',None,None,0)
will be equivalent to
doc = libxml2.parseDoc('<foo/>')
When using XSL, I have found it better to force entities
to be resolved before running the transform, in which case
it is useful to
use the following:
doc = libxml2.readDoc( xml, None, libxml2.XML_PARSE_NOENT)
readFd(fd, URL, encoding, options) - load an XML document
from a file descriptor
readFile(filename, encoding, options) - load an XML
document from a file allowing
the specification of per-document options.
readMemory(buffer, size, URL, encoding, options) - for
Python, equivalent to
using readDoc
recoverDoc(cur) - this is equivalent to readDoc, except
that even broken XML
will result in a valid XML tree being created.
doc =
libxml2.recoverDoc('<foo><broken></foo>')
will raise a parser error, but after the error has been
handled, doc will
contain:
<?xml version="1.0"?>
<foo><broken/></foo>
recoverFile(filename) - same as recoverDoc, but for files.
In the simplest case, to load a file from disk you can do:
doc = libxml2.parseFile( 'test.xml' )
[Managing your memory]
Ugh, nasty memory management. Isn't that why we're using
Python, to avoid all that
stuff?
Libxml2 does not explicitly handle the cleaning up of the
memory it uses, so when
you finish working with your xmlDoc object, you need to
remember to call freeDoc.
OK, so what we have now is something like the following:
doc = libxml2.parseFile( 'test.xml' )
# Do some stuff with the document here!
doc.freeDoc()
It doesn't matter which method you use to create your xmlDoc
object - each of the
functions return the same thing, so just remember to call
freeDoc on it when you
are done and all will be well.
There, that wasn't so hard was it?
[Working with the document]
Now we have a working document, and know how to dispose of
it when we're done
it is time to look at a number of common XML operations and
see how we can do
those using Libxml2 and Python.
[Elements]
The xmlDoc object has a large number of methods. As well as
its own collection,
it inherits from xmlNode, which inherits from xmlCore; this
gives you over 200
available methods to read up on! This is fairly daunting,
when you can't find an
example that shows you how to perform simple tasks but don't
worry, In practice
we can get by in most situations with a small fraction of
these.
All valid XML documents contain a single root node, which
contains all the
other nodes.
You can get a reference to the root element using
getRootElement on the document
object. The root element is an xmlNode object, just like all
other nodes in the
document. Working with nodes is fairly straightforward:
>>> import libxml2
>>> doc = libxml2.parseDoc( '<foo>Hello
world.</foo>' )
>>> root = doc.getRootElement()
>>> print root.name
foo
>>> print root.content
Hello world.
>>> root.setProp('bar', 'an attribute')
<xmlAttr (bar) object at 0x13c00d0>
>>> print root.serialize()
<foo bar="an attribute">Hello
world.</foo>
>>> doc.freeDoc()
The serialize method can be called on a single node, or on
the document and
provides a string representation of the document.
Navigating through the document is not much more difficult -
we can use the node
properties (from the xmlCore ancestor object) to find the
child nodes:
child = root.children
# the children property returns the FIRST child of a node
while child is not None:
if child.type == "element":
# do something with the child node
print child.name
child = child.next
Accessing the attributes of a node is possible in a similar
way
import libxml2
doc = libxml2.parseDoc('<foo att1="value 1"
att2="value 2"/>')
root = doc.getRootElement()
for property in root.properties:
if property.type=='attribute':
# do something with the attributes
print property.name
print property.content
doc.freeDoc()
Notice that in both looping through the children, and
looping through the
properties there is a test for the type of the node. This is
because in most
documents, there is additional whitespace that shows up as
well as the specific
node types we are interested in.
[XPath]
Navigating a document in this manner is straightforward, but
tedious and requires
accessing every node in the document until you get to the
specific one you need.
More often, you want to retrive a set of nodes or a single
node matching some
specific criteria. This is where XPath comes in, and Libxml2
has full support
for XPath.
XPath queries can be run against the document or a specific
element in the
document, but in either case the procedure is the same.
The xmlsoft.org Python page suggests the following:
doc = libxml2.parseFile("test.xml")
ctxt = doc.xpathNewContext()
result = ctxt.xpathEval("//*")
# do something with the result
doc.freeDoc()
ctxt.xpathFreeContext()
which involves creating an XPath context, running a query
against it and then
freeing the context when finished. If you have a lot of
queries to run, then
this is the best way to work, as the context can be re-used
for each query.
In practice, the xmlCore object provides a helper function
which wraps this up
for you. For single queries running xpathEval directly on
the node will suffice,
just be aware that each query creates and destroys its own
context, which is
going to be slower than the above implementation.
An XPath query will return a tuple of nodes. This makes it
easy to perform an
operation on many nodes at once.
import libxml2
doc = libxml2.parseFile('test.xml')
# select every element in the document
result = doc.xpathEval('//*')
for node in result:
print node.name
doc.freeDoc()
Apart from the call to freeDoc, I can't see how much more
Pythonic it could be?
[Writing to to a file]
To write the contents of your XML document to a file, just
use the saveTo method:
f = open('output.xml','w')
doc.saveTo(f)
f.close
The saveTo method is also part of xmlCore, so you can use it
to save the contents
of just a single node and it's children as well as the whole
document.
[Modifying documents]
To add a new node to a document, first we must create the
node and then add it
as a child of the element it belongs to.
import libxml2
doc = libxml2.parseDoc('<foo/>')
root = doc.getRootElement()
newNode = libxml2.newNode('bar')
root.addChild(newNode)
At this stage, our document contains
<?xml version="1.0"?>
<foo><bar/></foo>
Using the content property of newNode, we can do:
newNode.setContent('Hello')
We can append some content to our <bar/> element by
calling addContent,
newNode.addContent(' world')
which gives us
<?xml version="1.0"?>
<foo><bar>Hello world</bar></foo>
Creating or setting an attribute is easy to, we use the
setProp method.
newNode.setProp('attribute', 'the value')
If the attribute doesn't exist, it will be created otherwise
it will just have
its content changed.
Adding nodes at a particular location in the hierarchy is
possible using
addNextSibling, or addPrevSibling. These operate in the same
way as addChild,
except they operate on the node you wish to add next to,
rather than to the
parent.
sibling = libxml2.newNode('bar2')
newNode.addPrevSibling(sibling)
gives
<?xml version="1.0"?>
<foo><bar2/><bar new attribute="the
value">Hello world</bar></foo>
whereas
sibling = libxml2.newNode('bar2')
newNode.addNextSibling(sibling)
gives
<?xml version="1.0"?>
<foo><bar new attribute="the
value">Hello
world</bar><bar2/></foo>
To insert text into the document, you create a text node
with some content and
add it in the same way
text = libxml2.newText('some textn')
bar.addNextSibling(text)
which leaves us with
<?xml version="1.0"?>
<foo><bar2/><bar new attribute="the
value">Hello world</bar>some text
</foo>
To create content and nodes, the useful Libxml2 helper
functions are newComment,
newText and newNode. You can also create a new node by
copying one that already
exists. The xmlNode object has copyNode and copyProp methods
which can be useful
here.
To add these new nodes into a document, you need to use one
of the following
methods (directly on nodes rather than on the document),
addChild, addContent,
addNextSibling, addPrevSibling.
[XSLT]
Libxml2 has a companion library called libxslt which
provides support for
XSL Transformations. I find the following example provides
most of the
useful information for a Python coder:
def runTransform(xmlFile,xslFile):
out = ''
sourcedoc = libxml2.parseFile( xmlFile )
styledoc = libxml2.parseFile( xslFile )
style = libxslt.parseStylesheetDoc(styledoc)
result = style.applyStylesheet(sourcedoc, None)
out = style.saveResultToString( result )
style.freeStylesheet()
result.freeDoc()
sourcedoc.freeDoc()
return out
Notice that there are three documents involved, each of
which need to be
explicitly freed, the source, the stylesheet and the result.
The starting point
for documentation can be found here, http://xmlsoft.o
rg/XSLT/python.html.
[Libxsl2 and HTML]
If you have spent any time poking around libxml2.py, you
will probably have
noticed a number of functions that start with html. This is
because Libxml2 has
an HTML parser built in that does a pretty good job of
loading real world
(in other words horribly broken) HTML documents. You can
then use the features
we have previously discussed to read or modify the HTML.
The following example will load pretty much any HTML file
into an xmlDoc object
parse_options = libxml2.HTML_PARSE_RECOVER +
libxml2.HTML_PARSE_NOERROR +
libxml2.HTML_PARSE_NOWARNING
doc = libxml2.htmlReadDoc(html, '', None, parse_options)
Here is a more complete example, which extracts all the
links from the Guardian
newspaper Website home page and prints the href attribute.
import urllib2
import libxml2
# Load the page into a string
f = urllib2.urlopen('http://www.guardian.co.uk
a>')
html = f.read()
f.close()
parse_options = libxml2.HTML_PARSE_RECOVER +
libxml2.HTML_PARSE_NOERROR +
libxml2.HTML_PARSE_NOWARNING
doc = libxml2.htmlReadDoc(html,'',None,parse_options)
links = doc.xpathEval('//a')
for link in links:
href = link.xpathEval('attribute::href')
if len(href) > 0:
href = href[0].content
print href
doc.freeDoc()
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
xml gnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
|