List Info

Thread: parsing html in pl/sql to "page" the html.




parsing html in pl/sql to "page" the html.
country flaguser name
United States
2008-09-26 14:15:29

 

I've got an interesting task to do, and would like some advice.

 

I'll be receiving html text inside an nclob.   The files are report output from 200 to 300 different reports, written by a variety of programmers over the years.

 

The html text is of unknown quality or consistency.  It's safe, just of no known consistency.

 

Some of these html files can be very, very big, and the networks I'm deploying on are not fast.

 

Rewriting all those reports one-by-one is not an option for a host of reasons.

 

The idea is to parse the html into "pages" of a user-specified number of lines and only transmit the first page to the user to begin with, using a generic, one-size fits all procedure.  A few lines over the user size doesn't matter all that much, though it would be nice to be precise.

 

If the user wants to skip to the last page to get the totals, for example, we would only be transmitting 2 pages worth of data instead of very many.

 

Each "page" would be stored in an nclob in a paging table, structured so I know which pages go with which.

 

If someone has a better idea, I'm open to it.

 

Short of a better approach, my idea is to parse the html for the presence of certain tags such as <hr/>;, </tr>;, </p>, etc., because they represent "line feeds"; on a printed page.

Naturally, I will have to account for variations in capitalization, whitespace, and attributes.

 

Plus, of course, I have to account for "missing" tags that were a previous page, such as the start of a table, or the end of a table on the next page.

 

And, of course, this has to run fast or it totally defeats the purpose.

 

As a background note, in case it makes a difference, the data is received in xml format and passed to xslt templates for formatting into html.

 

It's possible that I might move this trick forward to the original xml file instead, but I suspect I would have to pass an xml tag (or tags) that would represent a line to the generic program.

 

If need be, I can do an external procedure in C to do this. ; Java in the database isn't an option, as I'll be deploying on Oracle XE, which doesn't support that.

 

Any suggestions? 

 

Architectural approach?&nbsp; Tools/utilities available on the market?  Gotchas to watch out for?

 

Thanks

 

David Wendelken

Dulcian, Inc.

 

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )