Jonathan Leybovich wrote:
>Hi Sebastian,
>
>I'm curious where the meta/bibliographic data is
>coming from for some of these open content projects.
>Project Gutenberg seems to keep relatively structured
>catalog data for its contents, but I'm wondering where
>anything other than title would come from for
>something like a Wikipedia article.
>
>
You have to do some detective work and possibly quite a bit
of mangling
for each source. It's been a fun exercise that has led to a
bug report
or two asking for new functionality in Zebra.
Gutenberg is
probably the easiest.. they've become *much* better at
structured
metadata than they used know. Wikipedia has a file
containing titles and
abstracts, which is what we use.. the default search
actually hits title
only, which I think is kind of appropriate for an
encyclopaedia and cuts
down on the noise. There's a (much) larger file containing
the entire
content, and it might be tempting to mine that for something
resembling
subject categories, one day, but it's messier data.. doing
full-text
indexing and data mining on that would be extremely
interesting, but
it's a project for another day. Dmoz is downloadable in RDF
form, the
Internet Archive is a spidering (with permission) of XML
metadata and
sometimes MARC records. In each case, we try to keep the
sources up to
date on a weekly basis.. for OAIster, we're still working
out the
details of the relationship.
Now if only Google would make structured metadata
available... but where
do you stick an ad in a DublinCore file?
I've set up a mailing list for the service, linked to from
the
opencontent page on our site.. if anyone has suggestions to
ways the
indexing or result presentation could be made better/more
useful, I'd
love to hear it.
Cheers,
--Sebastian
>
>
>>Date: Fri, 02 Mar 2007 09:12:34 -0500
>>From: Sebastian Hammer <quinn indexdata.com>
>>Subject: [Yazlist] Open Content and SRU/Z39.50
>>To: yazlist lists.indexdata.com
>>Message-ID: <45E830D2.1090108 indexdata.com>
>>Content-Type: text/plain; charset=ISO-8859-1;
>>format=flowed
>>
>>Hi guys,
>>
>>this is a follow-up to an earlier announcement on
>>this list. We've now
>>completed the initial setup of our Z/SRU targets
for
>>several open
>>content sites, specifically the Open Content
>>Alliance, Wikipedia, DMOZ,
>>and Project Gutenberg. Our hope is that exposing
>>these resources through
>>open information retrieval protocols will allow
>>libraries and others to
>>more easily integrate them into applications,
>>portals, and internet sites.
>>
>>More details are available at
>>http://www.index
data.dk/opencontent/ .
>>
>>
>>
>
>
>
>________________________________________________________
____________________________
>Be a PS3 game guru.
>Get your game face on with the latest PS3 news and
previews at Yahoo! Games.
>
http://videogames.yahoo.com/platform?platform=120121
>
>_______________________________________________
>Yazlist mailing list
>Yazlist lists.indexdata.dk
>http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
>
>
>
>
--
Sebastian Hammer, Index Data
quinn indexdata.com www.indexdata.com
Ph: (603) 209-6853 Fax: (866) 383-4485
_______________________________________________
Yazlist mailing list
Yazlist lists.indexdata.dk
http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yaz
list
|