List Info

Thread: lookahead pruning with event-based parsers




lookahead pruning with event-based parsers
user name
2006-04-04 18:44:07
Perl-XML Experts,

I'm parsing large XML files, so I have to use an event
based parser.
Sticking with SAX parsers for now.  I have found some really
good
articles on using SAX parsers.  They helped me write a
filter that
prunes one tag, and all its related data and sub-tags, from
a large
XML file.  Great!  Now I need to do something more
sophisticated.

Here is a simple XML fragment ...

   <ofd>
      <dwarf>
         <section>
             ... many tags later ...
             <name>fred</name>
             ...
         </section>
         <section>
             ...
             <name>wilma</name>
             ...
         </section>
         <section>
             ...
             <name>pebbles</name>
             ...
         </section>
         ...
     </dwarf>
   </ofd>

I want to keep the section named "wilma" and
prune the rest of them.
These sections are very large, thus somehow buffering in
memory is not
an option.

The decision to keep or prune a section cannot be made until
the name
is seen.  But by the time the name is seen, many parts of
the section
have already been processed.  It seems I need to somehow
lookahead
and know the name of section just as the <section> tag
is being
processed.
I can't figure out how to do that.

I'd appreciate any pointers to articles, tutorials,
whatever, that do
something similar with SAX parsers.

One limitation is that I cannot change the XML (like add a
name
attribute to <section>).  It is auto-generated from a
tool that has
already been released.

A more unfortunate limitation ... I am limited to modules
that can 
easily be installed via "ppm".  Why?  I'd be
happy to tell you.  But
it is pretty boring, and there is nothing I can do about it.
 Don't
worry with this limitation for the moment.  Just know this
could be
why I may not be able to use your favorite module.

Many thanks!

-George
         

_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
lookahead pruning with event-based parsers
user name
2006-04-04 19:46:07
On Tuesday 04 April 2006 21:44, Mock, George wrote:
> Perl-XML Experts,
>
> I'm parsing large XML files, so I have to use an event
based parser.
> Sticking with SAX parsers for now.  I have found some
really good
> articles on using SAX parsers.  They helped me write a
filter that
> prunes one tag, and all its related data and sub-tags,
from a large
> XML file.  Great!  Now I need to do something more
sophisticated.
>
> Here is a simple XML fragment ...
>
>    <ofd>
>       <dwarf>
>          <section>
>              ... many tags later ...
>              <name>fred</name>
>              ...
>          </section>
>          <section>
>              ...
>              <name>wilma</name>
>              ...
>          </section>
>          <section>
>              ...
>              <name>pebbles</name>
>              ...
>          </section>
>          ...
>      </dwarf>
>    </ofd>
>
> I want to keep the section named "wilma"
and prune the rest of them.
> These sections are very large, thus somehow buffering
in memory is not
> an option.
>
> The decision to keep or prune a section cannot be made
until the name
> is seen.  But by the time the name is seen, many parts
of the section
> have already been processed.  It seems I need to
somehow lookahead
> and know the name of section just as the
<section> tag is being
> processed.
> I can't figure out how to do that.
>
> I'd appreciate any pointers to articles, tutorials,
whatever, that do
> something similar with SAX parsers.
>
> One limitation is that I cannot change the XML (like
add a name
> attribute to <section>).  It is auto-generated
from a tool that has
> already been released.
>
> A more unfortunate limitation ... I am limited to
modules that can
> easily be installed via "ppm".  Why?  I'd
be happy to tell you.  But
> it is pretty boring, and there is nothing I can do
about it.  Don't
> worry with this limitation for the moment.  Just know
this could be
> why I may not be able to use your favorite module.

Hi Mr. Mock!

Well, the only idea I came up with is to process the XML
twice using SAX. 
First keep a count of the sections and see in what section
number you get the 
right <name> tag. Then, start over and once you reach
the right section 
number process it, and keep the parts of it that you need.
Using this 
solution, you don't need anything except SAX.

Regards,

	Shlomi Fish

------------------------------------------------------------
---------
Shlomi Fish      shlomifiglu.org.il
Homepage:        http://www.shlomifish.org/


95% of the programmers consider 95% of the code they did not
write, in the
bottom 5%.
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
lookahead pruning with event-based parsers
user name
2006-04-04 19:56:28
Hi George,

* Mock, George <gmockti.com> [2006-04-04
21:15]:
> The decision to keep or prune a section cannot be made
until
> the name is seen. But by the time the name is seen,
many parts
> of the section have already been processed. It seems I
need to
> somehow lookahead and know the name of section just as
the
> <section> tag is being processed. I can't figure
out how to do
> that.

you can’t figure it out, because it’s not possible. What
you need
to do is start buffering events when you see a `start-tag`
event
for `section`, and then decide whether to emit the buffered
events or discard them once you see the `end-tag` event for
the
`name` element.

As any non-trivial processing with SAX, your task requires a
basic state machine and a replayable event buffer. Such code
is
tedious and hard to understand for complex requirements, but
that’s the price for using SAX.

If you need to do much processing at that level, do have a
look
at STX (<http://stx.sf.net/>),
which has a Perl implementation
called XML::STX. This is a transformation language that
operates
on SAX but looks a lot like XSLT. STX makes work with SAX
much
easier because it keeps track of a minimal amount of context
and
formalizes the notion of event buffers, so you can
concentrate
on the actual transform instead of handling red tape.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/&g
t;
_______________________________________________
Perl-XML mailing list
Perl-XMLlistserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )