List Info

Thread: Requirements for mass digitization projects




Requirements for mass digitization projects
user name
2006-12-29 03:10:05
There is another aspect to the Google approach to
large-scale 
text capture which calls for comment. Google has so far
shown 
little sensitivity to the requirement to provide consistent
and 
reliable meta-data. There is now more information available
from 
Google on the books it has scanned (click on the 'About this
book 
link'). Indeed some of it seems rather promising (eg the
list of 
'key words and phrases' - but I dont see an explanation of
how 
these phrases were selected -- presumably an algorithm);
here is 
a typical instance:

http://books.google.com/books
?vid=OCLC61913221&id=cD0JAAAAIAAJ&dq=clarendon

And for the public domain books, or some of them, these
keywords 
are themselves hyper-linked to occurrences throughout the
text. A 
very nice and typically Google-ish touch.

But in spite of some recent improvements there is overall a
lack 
of explicitness and satisfactory meta-data from Google Book 
Search. There is still no reliable way of finding out 
systematically which books have been incorporated into
Google 
Book Search and when they were incorporated, and it would be

useful to be able to find out from which copy of an early
book 
the Google scan was made (this information could be easily 
incorporated where library copies have been used).

I havent been following the other large-scale digitization 
projects at all closely, but I think its very likely that
the 
best projects will have very open and explicit bibliographic

metadata -- and Google should behave in a more explicit and
open 
way if its project is to carry on.

BTW if you doubt that the Google project has some low-grade 
scans, take a look at this page (which is proffered by
Google as 
the first of three typical Selected Pages for the book
concerned)

http://b
ooks.google.com/books?vid=OCLC09756236&id=CrKWwZ4d29AC&a
mp;pg=RA42-PA1&dq=thomas+hodgkin

One wonders whose fat fingers those were, that have been
scanned 
in place of valuable 19th C text?

These large-scale digitization projects have enormous
potential. 
I hope that the librarians ensure that the ones that work
are the 
ones with good catalog data and superior meta-data. That 
criterion should be added to any list of desiderata, and it
bears 
on all the aspects mentioned by Joe Esposito.

adam


On 12/25/06, Joseph J. Esposito <espositojgmail.com> wrote:

> Ongoing discussions about various mass digitization
projects, 
> driven primarily by the Google Libraries program but
including 
> the respective activities of Microsoft, the Open
Content 
> Alliance, and others, prompts these comments about what
should 
> be taken into account as these programs proceed.  My
concern is 
> a practical one:  Some projects are incomplete in their
design, 
> which will likely result in their having to be redone
in the 
> near future, an expense that the world of scholarly 
> communications can ill afford.  There are at least four

> essential characteristics of any such project, and
there may 
> very well be more.
>
> As many have noted, the first requirement of such a
project is 
> that it adopt an archival approach.  Some scanning is
now being 
> done with little regard for preserving the entire
informational 
> context of the original.  Scanning first editions of
Dickens 
> gives us nothing if the scans do not precisely copy
first 
> editions of Dickens; the corollary to this is that
clearly 
> articulated policies about archiving must be part of
any mass 
> digitization project.  Some commercial projects have
little 
> regard for this, as archival quality simply is not part
of the 
> business plan; only members of the library community
are in a 
> position to assert the importance of this.  An archival

> certification board is evolving as a scholarly
desideratum.
>
> Archives of digital facsimiles are important, but we
also need 
> readers' editions, the second requirement of mass
digitization 
> projects.  This goes beyond scanning and involves the
editorial 
> process that is usually associated with the publishing 
> industry. The point is not simply to preserve the
cultural 
> legacy but to make it more available to scholars,
students, and 
> interested laypeople.  The high school student who
first 
> encounters Dickens's "Great Expectations"
should not also be 
> asked to fight with Victorian typography, not to
mention 
> orthography.  In the absence of readers' editions,
broad public 
> support for mass digitization projects will be
difficult to 
> come by.
>
> As devotees of "Web 2.0" insist with
increasing frequency, all 
> documents are in some sense community documents.  Thus
scanned 
> and edited material must be placed into a technical
environment 
> that enables ongoing annotation and commentary.  The 
> supplemental commentary may in time be of greater
importance 
> than the initial or "founding" document
itself, and some 
> comments may themselves become seminal.  I become
uneasy, 
> however, when the third requirement of community
engagement is 
> not paired with the first of archival fidelity.  What
do we 
> gain when "The Declaration of Independence"
is mounted on a Web 
> site as a wiki?  Sitting beneath the fascinating
activities of 
> an intellectually engaged community must be the curated

> archival foundation.
>
> The fourth requirement is that mass digitization
projects 
> should yield file structures and tools that allow for
machine 
> process to work with the content.  Whether this is
called 
> "pattern recognition" or "data
mining" or something else is not 
> important. What is important is to recognize that the
world of 
> research increasingly will be populated by robots, a
term that 
> no longer can or should carry a negative connotation. 
Some 
> people call this "Web 3.0", but I prefer to
think of it as "the 
> post-human Internet," which may not even be a
World Wide Web 
> application.
>
> To my knowledge, none of the current mass digitization
projects 
> fully incorporate all four of these requirements.
>
> Note that I am not including any mention of copyright
here, 
> which is the topic that gets the most attention when
mass 
> digitization is contemplated. All four of these
requirements 
> hold for public domain documents.  Copyright is a red
herring.
>
> Joe Esposito

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )