|
List Info
Thread: RE: parsing html in pl/sql to "page" the html.
|
|
| RE: parsing html in pl/sql to
"page" the html. |
  United States |
2008-09-26 14:45:29 |
|
Since you have mentioned there’s no consistency in
the html pages, is converting them to another format like PDF an option ?
Like converting to PDF (using iText http://www.lowagie.com/iText/) and
maybe split to separate pages.
From:
ml-errors fatcity.com [mailto:ml-errors fatcity.com] On Behalf Of David
Wendelken
Sent: Friday, September 26, 2008 12:15 PM
To: Multiple recipients of list ODTUG-SQLPLUS-L
Subject: parsing html in pl/sql to "page" the html.
I've
got an interesting task to do, and would like some advice.
I'll
be receiving html text inside an nclob. The files are report output
from 200 to 300 different reports, written by a variety of programmers over the
years.
The
html text is of unknown quality or consistency. It's safe, just of no
known consistency.
Some
of these html files can be very, very big, and the networks I'm deploying on
are not fast.
Rewriting
all those reports one-by-one is not an option for a host of reasons.
The
idea is to parse the html into "pages" of a user-specified number of
lines and only transmit the first page to the user to begin with, using a
generic, one-size fits all procedure. A few lines over the user size
doesn't matter all that much, though it would be nice to be precise.
If
the user wants to skip to the last page to get the totals, for example, we
would only be transmitting 2 pages worth of data instead of very many.
Each
"page" would be stored in an nclob in a paging table, structured so I
know which pages go with which.
If
someone has a better idea, I'm open to it.
Short
of a better approach, my idea is to parse the html for the presence of certain
tags such as <hr/>, </tr>, </p>, etc., because they represent
"line feeds" on a printed page.
Naturally,
I will have to account for variations in capitalization, whitespace, and
attributes.
Plus,
of course, I have to account for "missing" tags that were a previous
page, such as the start of a table, or the end of a table on the next page.
And,
of course, this has to run fast or it totally defeats the purpose.
As
a background note, in case it makes a difference, the data is received in xml
format and passed to xslt templates for formatting into html.
It's
possible that I might move this trick forward to the original xml file instead,
but I suspect I would have to pass an xml tag (or tags) that would represent a
line to the generic program.
If
need be, I can do an external procedure in C to do this. Java in the
database isn't an option, as I'll be deploying on Oracle XE, which doesn't
support that.
Any
suggestions?
Architectural
approach? Tools/utilities available on the market? Gotchas to watch
out for?
Thanks
David
Wendelken
Dulcian,
Inc.
|
[1]
|
|
|
about | contact Other archives ( Real Estate discussion Medical topics )
|