List Info

Thread: Python documentation - any help welcome!




Python documentation - any help welcome!
country flaguser name
United Kingdom
2007-02-06 07:15:28
Hi,

After struggling to get to grips with Libxml2 and Python, I
figured that although I can't contribute much in the way of
code, I can have a crack at getting some useful
documentation up together.

I have put the first part up on my Wiki, if anyone would
care to review for accuracy - or help out where it is a bit
light on examples?

http://mikekneller.c
om/wiki/index.php?title=Getting_started_with_Libxml2_and_Pyt
hon_-_part_1

I realise that this is probably a bit n00b for most here,
but I would like to bring together workable examples from
the ground up, most of the other information I have read
assumes a level of knowledge I just didn't have when I
encountered the library for the first time.

For reference, I'll post the text here.

Cheers
Mike

=== Getting started with Libxml2 and Python - Part 1 ===

Overview

Getting to grips with Libxml2 and Python can be a
frustrating experience, 
particularly as in-depth, accurate Python documentation is
hard to find 
on the Web.

Many Python developers dislike the Libxml2 bindings, as they
are 'un-Pythonic'
and much too C-like. This however misses the point of
Libxml2. The point is that
this library is portable, mature, extremely full-featured
and *very* fast.

In the process of writing this tutorial, I hung out in the
#xml channel on 
irc.gnome.org, and subscribed to the xmlgnome.org
mailing list - I 
was given a lot of help when things weren't obvious!
Although there's not a massive 
amount of activity on IRC, or in the mailing list on a daily
basis, I would
definitely recommend spending some time browsing the archive
- or using Google
to search it when you have questions. Additionally, I have
found the people in 
the Libxml2 community very helpful. 

Manipulating XML using Libxml2 is fairly straightforward
when you have a couple
of working examples, however that tends to be the problem in
Python. Finding 
working examples tends to be a bit of a hit-and-miss
affair.

The first place to look is in the examples folder in the
documentation installed
with your release
(/usr/share/doc/libxml2-python-2.6.27/examples on my
machine).

TODO: where are the examples on a number of
distributions/platforms?

Also, take a moment to scan through libxml2.py itself - this
is the Python wrapper and
is a good place to look if you are hunting for a particular
function. There
is plenty of information in the wrapper as all the
docstrings have been 
populated, you can always get information like

	print libxml2.parseFile.__doc__
	
for any particular function.

Also remember that you can list the available methods for
any Python object by 
using the dir function. The most immediately useful objects
are xmlCore, xmlNode
xmlDoc, so
	dir(libxml2.xmlCore)
is your friend when working out what functions are available
to you.

I'm going to assume that you know a bit about XML, at least
enough to recognise
an XML document when you see one, and hopefully enough about
Python to know 
where to find the documentation!

[installing Libxml2]

TODO: installation examples for a number of
distros/platforms.

[Loading a document]

The first thing you want to do in XML will be to load a
document of some sort.
As a new Libxml2 user, this is where our confusion starts!
It is worth remembering
that in general, the Python bindings are automatically
generated - therefore
there is an equivalent Python function for every C function,
and sometimes this
can lead to unnecessary, or apparently duplicated Python
functions.

The library contains a number of different functions we can
use to load an XML 
document:

	parseDoc, parseFile, parseMemory, readDoc, readFd,
readFile, readMemory,
	recoverDoc and recoverFile

All of these functions return an xmlDoc object. Examples for
using each of these
follow:

	parseDoc(cur) - load an XML document from memory (a
string)

	doc = libxml2.parseDoc("""<?xml
version="1.0"?>
	<root>Hello world!</root>""")	
	

	parseMemory(buffer, size) - load an XML document from
memory
	
	doc = libxml2.parseMemory(xml, len(xml))
	
This function performs exactly the same job as parseDoc from
a Python perspective.
	

	parseFile(filename) - load an XML document from a file
	
	doc = libxml2.parseFile('test.xml')

	
	readDoc(cur, URL, encoding, options) - load an XML document
from memory (a string)
	
This version of the function allows you to specify options
on a per-document
basis. The parseDoc version uses the parser defaults (in
practice, the 
parser global settings, which can also be modified using
global functions).
	
	In most cases,
		doc = libxml2.readDoc('<foo/>',None,None,0)
	will be equivalent to
		doc = libxml2.parseDoc('<foo/>')

When using XSL, I have found it better to force entities
to be resolved before running the transform, in which case
it is useful to
use the following:
	
	doc = libxml2.readDoc( xml, None, libxml2.XML_PARSE_NOENT)
	

	readFd(fd, URL, encoding, options) - load an XML document
from a file descriptor
	
	readFile(filename, encoding, options) - load an XML
document from a file allowing
	the specification of per-document options.

	
	readMemory(buffer, size, URL, encoding, options) - for
Python, equivalent to
	using readDoc


	recoverDoc(cur) - this is equivalent to readDoc, except
that even broken XML
	will result in a valid XML tree being created.
	
	doc =
libxml2.recoverDoc('<foo><broken></foo>')
	
will raise a parser error, but after the error has been
handled, doc will
contain:
	<?xml version="1.0"?>
	<foo><broken/></foo>


	recoverFile(filename) - same as recoverDoc, but for files.


In the simplest case, to load a file from disk you can do:

	doc = libxml2.parseFile( 'test.xml' )

[Managing your memory]

Ugh, nasty memory management. Isn't that why we're using
Python, to avoid all that
stuff?

Libxml2 does not explicitly handle the cleaning up of the
memory it uses, so when 
you finish working with your xmlDoc object, you need to
remember to call freeDoc.

OK, so what we have now is something like the following:

	doc = libxml2.parseFile( 'test.xml' )
	# Do some stuff with the document here!
	doc.freeDoc()

It doesn't matter which method you use to create your xmlDoc
object - each of the
functions return the same thing, so just remember to call
freeDoc on it when you
are done and all will be well.

There, that wasn't so hard was it? 


[Working with the document]

Now we have a working document, and know how to dispose of
it when we're done
it is time to look at a number of common XML operations and
see how we can do
those using Libxml2 and Python.

[Elements]

The xmlDoc object has a large number of methods. As well as
its own collection, 
it inherits from xmlNode, which inherits from xmlCore; this
gives you over 200
available methods to read up on! This is fairly daunting,
when you can't find an
example that shows you how to perform simple tasks but don't
worry, In practice
we can get by in most situations with a small fraction of
these.

All valid XML documents contain a single root node, which
contains all the
other nodes.

You can get a reference to the root element using
getRootElement on the document
object. The root element is an xmlNode object, just like all
other nodes in the 
document. Working with nodes is fairly straightforward:

	>>> import libxml2
	>>> doc = libxml2.parseDoc( '<foo>Hello
world.</foo>' )
	>>> root = doc.getRootElement()
	>>> print root.name
	foo
	>>> print root.content
	Hello world.
	>>> root.setProp('bar', 'an attribute')
	<xmlAttr (bar) object at 0x13c00d0>
	>>> print root.serialize()
	<foo bar="an attribute">Hello
world.</foo>
	>>> doc.freeDoc()

The serialize method can be called on a single node, or on
the document and 
provides a string representation of the document.

Navigating through the document is not much more difficult -
we can use the node
properties (from the xmlCore ancestor object) to find the
child nodes:

	child = root.children
	# the children property returns the FIRST child of a node
	while child is not None:
		if child.type == "element":
			# do something with the child node
			print child.name
			child = child.next

Accessing the attributes of a node is possible in a similar
way

	import libxml2
	doc = libxml2.parseDoc('<foo att1="value 1"
att2="value 2"/>')
	root = doc.getRootElement()
	for property in root.properties:
		if property.type=='attribute':
			# do something with the attributes
			print property.name
			print property.content
	doc.freeDoc()

Notice that in both looping through the children, and
looping through the 
properties there is a test for the type of the node. This is
because in most
documents, there is additional whitespace that shows up as
well as the specific
node types we are interested in.

[XPath]

Navigating a document in this manner is straightforward, but
tedious and requires
accessing every node in the document until you get to the
specific one you need.
More often, you want to retrive a set of nodes or a single
node matching some
specific criteria. This is where XPath comes in, and Libxml2
has full support
for XPath.

XPath queries can be run against the document or a specific
element in the 
document, but in either case the procedure is the same.

The xmlsoft.org Python page suggests the following:

	doc = libxml2.parseFile("test.xml")
	ctxt = doc.xpathNewContext()
	result = ctxt.xpathEval("//*")
	# do something with the result
	
	doc.freeDoc()
	ctxt.xpathFreeContext()

which involves creating an XPath context, running a query
against it and then
freeing the context when finished. If you have a lot of
queries to run, then
this is the best way to work, as the context can be re-used
for each query.

In practice, the xmlCore object provides a helper function
which wraps this up 
for you. For single queries running xpathEval directly on
the node will suffice, 
just be aware that each query creates and destroys its own
context, which is 
going to be slower than the above implementation.

An XPath query will return a tuple of nodes. This makes it
easy to perform an
operation on many nodes at once.

	import libxml2
	doc = libxml2.parseFile('test.xml')
	# select every element in the document
	result = doc.xpathEval('//*')
	for node in result:
		print node.name
	doc.freeDoc()

Apart from the call to freeDoc, I can't see how much more
Pythonic it could be?

[Writing to to a file]

To write the contents of your XML document to a file, just
use the saveTo method:

	f = open('output.xml','w')
	doc.saveTo(f)
	f.close

The saveTo method is also part of xmlCore, so you can use it
to save the contents
of just a single node and it's children as well as the whole
document.

[Modifying documents]

To add a new node to a document, first we must create the
node and then add it 
as a child of the element it belongs to.

	import libxml2
	doc = libxml2.parseDoc('<foo/>')
	root = doc.getRootElement()
	newNode = libxml2.newNode('bar')
	root.addChild(newNode)

At this stage, our document contains

	<?xml version="1.0"?>
	<foo><bar/></foo>

Using the content property of newNode, we can do:
	
	newNode.setContent('Hello')

We can append some content to our <bar/> element by
calling addContent,

	newNode.addContent(' world')
	
which gives us

	<?xml version="1.0"?>
	<foo><bar>Hello world</bar></foo>

Creating or setting an attribute is easy to, we use the
setProp method.

	newNode.setProp('attribute', 'the value')

If the attribute doesn't exist, it will be created otherwise
it will just have
its content changed.

Adding nodes at a particular location in the hierarchy is
possible using 
addNextSibling, or addPrevSibling. These operate in the same
way as addChild, 
except they operate on the node you wish to add next to,
rather than to the
parent.

	sibling = libxml2.newNode('bar2') 
	newNode.addPrevSibling(sibling)

gives

	<?xml version="1.0"?>
	<foo><bar2/><bar new attribute="the
value">Hello world</bar></foo>

whereas

	sibling = libxml2.newNode('bar2') 
	newNode.addNextSibling(sibling)

gives

	<?xml version="1.0"?>
	<foo><bar new attribute="the
value">Hello
world</bar><bar2/></foo>

To insert text into the document, you create a text node
with some content and
add it in the same way

	text = libxml2.newText('some textn')
	bar.addNextSibling(text)
	
which leaves us with

	<?xml version="1.0"?>
	<foo><bar2/><bar new attribute="the
value">Hello world</bar>some text
	</foo>

To create content and nodes, the useful Libxml2 helper
functions are newComment,
newText and	newNode. You can also create a new node by
copying one that already 
exists. The xmlNode object has copyNode and copyProp methods
which can be useful
here.

To add these new nodes into a document, you need to use one
of the following
methods (directly on nodes rather than on the document),
addChild, addContent,
addNextSibling, addPrevSibling.

[XSLT]

Libxml2 has a companion library called libxslt which
provides support for
XSL Transformations. I find the following example provides
most of the 
useful information for a Python coder:

	def runTransform(xmlFile,xslFile):
		out = ''
		sourcedoc = libxml2.parseFile( xmlFile )
		styledoc = libxml2.parseFile( xslFile )
		style = libxslt.parseStylesheetDoc(styledoc)
		result = style.applyStylesheet(sourcedoc, None)
		out = style.saveResultToString( result )
		style.freeStylesheet()
		result.freeDoc()
		sourcedoc.freeDoc()
		return out

Notice that there are three documents involved, each of
which need to be 
explicitly freed, the source, the stylesheet and the result.
The starting point
for documentation can be found here, http://xmlsoft.o
rg/XSLT/python.html.

[Libxsl2 and HTML]

If you have spent any time poking around libxml2.py, you
will probably have 
noticed a number of functions that start with html. This is
because Libxml2 has
an HTML parser built in that does a pretty good job of
loading real world 
(in other words horribly broken) HTML documents. You can
then use the features
we have previously discussed to read or modify the HTML.

The following example will load pretty much any HTML file
into an xmlDoc object

	parse_options = libxml2.HTML_PARSE_RECOVER + 
		libxml2.HTML_PARSE_NOERROR + 
		libxml2.HTML_PARSE_NOWARNING
	doc = libxml2.htmlReadDoc(html, '', None, parse_options)

Here is a more complete example, which extracts all the
links from the Guardian
newspaper Website home page and prints the href attribute.

	import urllib2
	import libxml2

	# Load the page into a string
	f = urllib2.urlopen('http://www.guardian.co.uk')
	html = f.read()
	f.close()

	parse_options = libxml2.HTML_PARSE_RECOVER + 
		libxml2.HTML_PARSE_NOERROR + 
		libxml2.HTML_PARSE_NOWARNING
	doc = libxml2.htmlReadDoc(html,'',None,parse_options)
	links = doc.xpathEval('//a')
	for link in links:
		href = link.xpathEval('attribute::href')
		if len(href) > 0:
			href = href[0].content	
			print href
	doc.freeDoc()


_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml

Re: Python documentation - any help welcome!
user name
2007-02-06 08:07:00
Mike Kneller <ukchillmac.com> writes:

> Hi,
>
> After struggling to get to grips with Libxml2 and
Python, I figured
> that although I can't contribute much in the way of
code, I can have
> a crack at getting some useful documentation up
together.
>
> I have put the first part up on my Wiki, if anyone
would care to
> review for accuracy - or help out where it is a bit
light on
> examples?
>
> http://mikekneller.c
om/wiki/index.php?title=Getting_started_with_Libxml2_and_Pyt
hon_-_part_1
>
> I realise that this is probably a bit n00b for most
here, but I
> would like to bring together workable examples from the
ground up,
> most of the other information I have read assumes a
level of
> knowledge I just didn't have when I encountered the
library for the
> first time.

Hey! I didn't see this till just now. I'm doing a *lot*
with
libxml2/libxslt and python. I'll take a look at your doc and
let you
know what I think.

-- 
Nic Ferrier
http://www.tapsellfer
rier.co.uk   for all your tapsell ferrier needs
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml

Re: Python documentation - any help welcome!
user name
2007-02-06 10:48:29
Hi Mike:

I've used the libxml2 python bindings a fair bit and this is
a good
start on documenting them. There is a bit of a learning
curve but I
think that has more to do with learning libxml2 and less
with the python
bindings, but that said it's still nice to see python
specific
documentation. 

If I had any suggestions it would be to intersperse working
python code
examples for common operations in with the explanatory
prose. I think a
lot of folks just quickly want to know how to do basic
tasks, a sort of
cookbook FAQ. e.g. 

how do I parse a doc and find all foobar elements and return
a list of
them?

how do I build complex python objects by parsing an XML
doc?

how can I serialize python objects into XML?

etc.

The examples can illustrate basic concepts in libxml2.
-- 
John Dennis <jdennisredhat.com>

Learn. Network. Experience open source.
Red Hat Summit San Diego  |  May 9-11, 2007
Learn more: http://www.re
dhat.com/promo/summit/2007


_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )