There is another aspect to the Google approach to
large-scale
text capture which calls for comment. Google has so far
shown
little sensitivity to the requirement to provide consistent
and
reliable meta-data. There is now more information available
from
Google on the books it has scanned (click on the 'About this
book
link'). Indeed some of it seems rather promising (eg the
list of
'key words and phrases' - but I dont see an explanation of
how
these phrases were selected -- presumably an algorithm);
here is
a typical instance:
http://books.google.com/books
?vid=OCLC61913221&id=cD0JAAAAIAAJ&dq=clarendon
And for the public domain books, or some of them, these
keywords
are themselves hyper-linked to occurrences throughout the
text. A
very nice and typically Google-ish touch.
But in spite of some recent improvements there is overall a
lack
of explicitness and satisfactory meta-data from Google Book
Search. There is still no reliable way of finding out
systematically which books have been incorporated into
Google
Book Search and when they were incorporated, and it would be
useful to be able to find out from which copy of an early
book
the Google scan was made (this information could be easily
incorporated where library copies have been used).
I havent been following the other large-scale digitization
projects at all closely, but I think its very likely that
the
best projects will have very open and explicit bibliographic
metadata -- and Google should behave in a more explicit and
open
way if its project is to carry on.
BTW if you doubt that the Google project has some low-grade
scans, take a look at this page (which is proffered by
Google as
the first of three typical Selected Pages for the book
concerned)
http://b
ooks.google.com/books?vid=OCLC09756236&id=CrKWwZ4d29AC&a
mp;pg=RA42-PA1&dq=thomas+hodgkin
One wonders whose fat fingers those were, that have been
scanned
in place of valuable 19th C text?
These large-scale digitization projects have enormous
potential.
I hope that the librarians ensure that the ones that work
are the
ones with good catalog data and superior meta-data. That
criterion should be added to any list of desiderata, and it
bears
on all the aspects mentioned by Joe Esposito.
adam
On 12/25/06, Joseph J. Esposito <espositoj gmail.com> wrote:
> Ongoing discussions about various mass digitization
projects,
> driven primarily by the Google Libraries program but
including
> the respective activities of Microsoft, the Open
Content
> Alliance, and others, prompts these comments about what
should
> be taken into account as these programs proceed. My
concern is
> a practical one: Some projects are incomplete in their
design,
> which will likely result in their having to be redone
in the
> near future, an expense that the world of scholarly
> communications can ill afford. There are at least four
> essential characteristics of any such project, and
there may
> very well be more.
>
> As many have noted, the first requirement of such a
project is
> that it adopt an archival approach. Some scanning is
now being
> done with little regard for preserving the entire
informational
> context of the original. Scanning first editions of
Dickens
> gives us nothing if the scans do not precisely copy
first
> editions of Dickens; the corollary to this is that
clearly
> articulated policies about archiving must be part of
any mass
> digitization project. Some commercial projects have
little
> regard for this, as archival quality simply is not part
of the
> business plan; only members of the library community
are in a
> position to assert the importance of this. An archival
> certification board is evolving as a scholarly
desideratum.
>
> Archives of digital facsimiles are important, but we
also need
> readers' editions, the second requirement of mass
digitization
> projects. This goes beyond scanning and involves the
editorial
> process that is usually associated with the publishing
> industry. The point is not simply to preserve the
cultural
> legacy but to make it more available to scholars,
students, and
> interested laypeople. The high school student who
first
> encounters Dickens's "Great Expectations"
should not also be
> asked to fight with Victorian typography, not to
mention
> orthography. In the absence of readers' editions,
broad public
> support for mass digitization projects will be
difficult to
> come by.
>
> As devotees of "Web 2.0" insist with
increasing frequency, all
> documents are in some sense community documents. Thus
scanned
> and edited material must be placed into a technical
environment
> that enables ongoing annotation and commentary. The
> supplemental commentary may in time be of greater
importance
> than the initial or "founding" document
itself, and some
> comments may themselves become seminal. I become
uneasy,
> however, when the third requirement of community
engagement is
> not paired with the first of archival fidelity. What
do we
> gain when "The Declaration of Independence"
is mounted on a Web
> site as a wiki? Sitting beneath the fascinating
activities of
> an intellectually engaged community must be the curated
> archival foundation.
>
> The fourth requirement is that mass digitization
projects
> should yield file structures and tools that allow for
machine
> process to work with the content. Whether this is
called
> "pattern recognition" or "data
mining" or something else is not
> important. What is important is to recognize that the
world of
> research increasingly will be populated by robots, a
term that
> no longer can or should carry a negative connotation.
Some
> people call this "Web 3.0", but I prefer to
think of it as "the
> post-human Internet," which may not even be a
World Wide Web
> application.
>
> To my knowledge, none of the current mass digitization
projects
> fully incorporate all four of these requirements.
>
> Note that I am not including any mention of copyright
here,
> which is the topic that gets the most attention when
mass
> digitization is contemplated. All four of these
requirements
> hold for public domain documents. Copyright is a red
herring.
>
> Joe Esposito
|