Laurence Finston wrote:
>Last night, I had to send off my message in a hurry.
Here are a couple of
>additional comments.
>
>On Wed, 14 Feb 2007, Timothy Murphy wrote:
>
>
>
>>On Wednesday 14 February 2007 12:43, lfinsto1 gwdg.de
wrote:
>>
>>Over the years I have grown more and more ashamed of
this system
>>(accessible I think at <http://www
.maths.tcd.ie/local/library/>),
>>and long ago decided it was time for a change.
>>
>>
>
>I don't think there's any need to be ashamed of a
program that has
>worked well for 20 years. I've just looked up `refer'
and found that,
>on a GNU/Linux system, it's part of the `groff' package.
Apparently,
>it implements a simple database in the form of a text
file, and the
>manual page uses the term "database".
>
>
>
>>At present our secretaries enter new books "by
hand",
>>typing in author, title, etc.
>>It seems that this could be greatly simplified by a
program
>>in which the secretary simpy typed in the ISBN
number,
>>and which then accessed the Library of Congress
database,
>>and stored the entry, probably in XML format.
>>
>>
>
>Retrieving the XML data is a piece of cake. Apparently,
YAZ has a way
>of doing this, but I've only used YAZ so far for
retrieving Pica data
>from a Z39.50 server. To get the XML data from an OAI
server,
>I used an library function `get_http' (or something)
under Windows,
>and am now using GNU Wget under GNU/Linux.
>
>The usual way of approaching the problem from this point
is to parse
>the XML data and store the information in a data
structure, probably some
>kind of tree. This is the tricky part, and using
`libxml',
>some other library, or any of the many tools available
for processing
>XML doesn't seem to reduce the amount of work one has to
do significantly,
>no matter what approach one chooses. This is just my
impression, and I'd
>be interested to hear what other programmers' opinions
are. However,
>once the data is stored in the data structure, writing
it to a database
>or formatting it in various ways is reasonably
straightforward.
>It also makes it possible to do much more complicated
things with the data.
>It might be possible to write a script or a program that
can recognize
>some tags and perform simple transformations, or put
together a pipeline
>of utitilites to do this, as outlined by another poster.
If this would be
>adequate for your needs, great. However, my approach
would be to parse the
>data and store it in a data structure for the sake of
the additional
>functionality one could implement, once that's done.
>
>
Guys,
I'll offer up my opinion about the XML parsing issue as a
fellow
programmer. We started using XML-like data structures --
more inspired
by SGML -- long before the current family of XML tools came
of age, so
we have, over the years, developed a great many different ad
hoc
approaches to dealing with XML, from simple text-based
pattern matching
to more elaborate parsing code. However, I think everybody
at Index Data
has come to the conclusion that the benefits of using
standard packages
for this far outweigh the problems, and that with libxml
(and its
wrappers in various other languages), the problems are few.
Some of the benefits you get from using libxml include:
1) It is blindingly fast, and very robust, in our
experience. In fact,
it is fast enough that in future versions of our Zebra
indexer, we will
be recommending an XML/XSLT-based approach to designing
indexing rules
as the best overall approach. We've found that this code is
actually
faster, and much, much more functional than the homegrown
indexing
mechanism we had originally developed.
2) Because it is a formal implementation of the whole
standard, it will
reliably deal with all the different things that might go
into an XML
document, like namespaces, processing instructions. If it
fails, you can
generally be pretty sure it's because the document is not
well formed,
and you can go hassle whoever sent you the document about
it. Going a
step further and validating documents against
application-specific
schemas gives you even more control. We have found that
using a full,
conformant parser, and making XML the lingua franca for
commnicating
with customers saves us tons of time spent in parsing
people's homegrown
'almost-XML' formats.
3) Once you're using libxml, if a document isn't to your
liking, it's
simple a simple process to thrown your XML tree through an
XSLT
transformation... XSLT, for people who haven't played with
it, is
definitely worth getting to know -- it is a powerful,
flexible language
for transforming XML documents, and a pleasure to use.
4) Fianally, if you happen to be a C programmer, I have been
really
delighted with the 'tree' API in libxml.. I find it more
intuitive and
pleasant to use than many DOM-inspired APIs found in other
languages
(see http://xmlsoft.org/ex
ample.html).
It does take an effort to get to know it, but, having
developed several
XML-ish parsers myself, I can say that learning the libxml
API is
definitely easier and faster, and well-worth the effort for
all of the
fringe benefits you get.
Hope this is useful,
--Sebastian
>It takes a certain amount of effort to learn to use a
database package,
>but I believe it's worth the effort. Once one has
learned how, one
>will probably think of all sorts of uses for it. I
haven't looked into this
>thoroughly, but I'm fairly sure that the simple one used
with `refer' must
>be quite limited in comparison with, say `nosql' (which
I admittedly
>haven't started using yet). On the other hand, I do
believe in the principles
>"If it ain't broke, don't fix it" and
"Don't use a cannon to shoot at
>sparrows".
>
>
>Laurence
>
>_______________________________________________
>Yazlist mailing list
>Yazlist lists.indexdata.dk
>http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
>
>
>
--
Sebastian Hammer, Index Data
quinn indexdata.com www.indexdata.com
Ph: (603) 209-6853 Fax: (866) 383-4485
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|