List Info

Thread: How to parse an XML in SAX




How to parse an XML in SAX
country flaguser name
Mexico
2007-11-11 03:08:28
Hi I want to parse an XML using sax but my big issue are the
WhiteSpaces  
when they get reported. I want to know how to efficiently
ignore them. I  
know there are some DocumentHandlers and one specific for
ignore  
Whitespace but I still come up with a bunch of invisible
nodes like t or  
n.

Anyone have a tutorial on how to handle SAX for this kind of
parsing?

-- 
Alexandro Colorado
Help the Tabasco Relief efforts:
http://rootcoffee.blogspot.com/2007
/11/race-to-save-mexico-flood-victims.html
_______________________________________________
XML-SIG maillist  -  XML-SIGpython.org
http:
//mail.python.org/mailman/listinfo/xml-sig

Re: How to parse an XML in SAX
country flaguser name
Germany
2007-11-11 09:16:13
Alexandro Colorado wrote:
> Hi I want to parse an XML using sax

Any reason why you would want to do that?


> but my big issue are the WhiteSpaces  
> when they get reported. I want to know how to
efficiently ignore them. I  
> know there are some DocumentHandlers and one specific
for ignore  
> Whitespace but I still come up with a bunch of
invisible nodes like t or  
> n.
> 
> Anyone have a tutorial on how to handle SAX for this
kind of parsing?

Consider using cElementTree's iterparse() instead.

http://e
ffbot.org/zone/element-iterparse.htm

It's also available in lxml.etree.

Stefan
_______________________________________________
XML-SIG maillist  -  XML-SIGpython.org
http:
//mail.python.org/mailman/listinfo/xml-sig

Re: How to parse an XML in SAX
country flaguser name
Germany
2007-11-12 05:06:56
[going back to the list]

Alexandro Colorado wrote:
> On Sun, 11 Nov 2007 21:38:21 -0600, Stefan Behnel
<stefan_mlbehnel.de>
> wrote:
> 
>> The tool I actually mentioned, cElementTree, should
also work just
>> fine on
>> 2.3. Note also that ElementTree (without the 'c')
is pure Python, so it
>> doesn't require you to compile anything.
> 
> Thanks for selling me into ElementTree however I cant
because the
> version of the Python distribution that is being
shipped doesn't has
> element tree so this make this a particular situation
that I can only
> used the standard libraries.

I'm not sure I understand this. You are writing Python code,
right? Why can't
you just add another Python source file? (such as
ElementTree.py)

Stefan


> Now going back to SAX, is there a way I can escape the
non-printable
> characters and how exactly they get into it on the
first place. SAX is a
> very quick parser from what I've read. I have found
this tutorial
> between python and SAX:
> 
> http://www.devarticles.com/c/a/XML/Parsing-XM
L-with-SAX-and-Python/
> 
> I have move on to read other tutorials to see if they
can address this
> current issue. I am interested on this parsing
specifically to see a way
> of escaping or 'passing' the print out of special
characaters:
> 
>     def endElement(self,name):
>         if (name == "img") :
>             print "%8s %s" % (self.name,
self.title)
>             self.name = self.title = "" #
just for safety
>         if (name == "title") :
>             pass
> 
> Not sure what %8s and %s compared to escaping the /t or
/n.


_______________________________________________
XML-SIG maillist  -  XML-SIGpython.org
http:
//mail.python.org/mailman/listinfo/xml-sig

Re: How to parse an XML in SAX
user name
2007-12-03 17:39:29
> Hi I want to parse an XML using sax but my big issue
are the
> WhiteSpaces when they get reported. I want to know how
to efficiently
> ignore them. I know there are some DocumentHandlers and
one specific
> for ignore Whitespace but I still come up with a bunch
of invisible
> nodes like t or n.
> 
> Anyone have a tutorial on how to handle SAX for this
kind of parsing?

In general, the notion of "significant whitespace"
is pretty weak in
XML (independent of SAX, so I don't think Stefan's bashing
of SAX
was of any help). Here is what I know about it:
- white space should be preserved if the attribute xml:space
was
  given on an element, and has the value of
"preserve". Otherwise,
  it's up to the application on what precisely to do with
white
  space.
- white space in "element content" is usually
considered ignorable,
  and the XML spec requires that it is reported as such.
However,
  whether an element has element content depends on the DTD,
so only
  a validating parser can know. If you turn on validation on
in SAX,
  white space in element content will be reported through
the
  "ignorableWhitespace" event.

So, it's your own choice, and you should make that choice
based on
your knowledge of the actual XML application. Typical
options are
a) preserve all whitespace
b) perform validation, then strip all whitespace in element
content
c) drop white space that completely spans from one tag to
another,
   assuming the element has element content. In SAX, track
characterData
   since either the last startElement or endElement, and
then chose
   to drop the whitespace at the next startElement or
endElement.
d) In many cases, you have either element content or simple
text
   content, so in SAX, you can drop the white space if you
see nested
   elements.
e) strip whitespace, in the sense of Python's string.strip.
I.e.
   at endElement, perform .strip() on the collected data.

HTH,
Martin
_______________________________________________
XML-SIG maillist  -  XML-SIGpython.org
http:
//mail.python.org/mailman/listinfo/xml-sig

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )