> Hi I want to parse an XML using sax but my big issue
are the
> WhiteSpaces when they get reported. I want to know how
to efficiently
> ignore them. I know there are some DocumentHandlers and
one specific
> for ignore Whitespace but I still come up with a bunch
of invisible
> nodes like t or n.
>
> Anyone have a tutorial on how to handle SAX for this
kind of parsing?
In general, the notion of "significant whitespace"
is pretty weak in
XML (independent of SAX, so I don't think Stefan's bashing
of SAX
was of any help). Here is what I know about it:
- white space should be preserved if the attribute xml:space
was
given on an element, and has the value of
"preserve". Otherwise,
it's up to the application on what precisely to do with
white
space.
- white space in "element content" is usually
considered ignorable,
and the XML spec requires that it is reported as such.
However,
whether an element has element content depends on the DTD,
so only
a validating parser can know. If you turn on validation on
in SAX,
white space in element content will be reported through
the
"ignorableWhitespace" event.
So, it's your own choice, and you should make that choice
based on
your knowledge of the actual XML application. Typical
options are
a) preserve all whitespace
b) perform validation, then strip all whitespace in element
content
c) drop white space that completely spans from one tag to
another,
assuming the element has element content. In SAX, track
characterData
since either the last startElement or endElement, and
then chose
to drop the whitespace at the next startElement or
endElement.
d) In many cases, you have either element content or simple
text
content, so in SAX, you can drop the white space if you
see nested
elements.
e) strip whitespace, in the sense of Python's string.strip.
I.e.
at endElement, perform .strip() on the collected data.
HTH,
Martin
_______________________________________________
XML-SIG maillist - XML-SIG python.org
http:
//mail.python.org/mailman/listinfo/xml-sig
|