List Info

Thread: Data- and text-mining licensing




Data- and text-mining licensing
user name
2006-05-30 21:50:00
Data/text mining has been a area of scientific study since
the 
90's. For an overview, see: Losiewicz, Paul Oard, Douglas W

Kostoff, Donald N. "Science and Technology Text Mining
Basic 
Concepts" (ADA415886). January 01, 2003 28 Page(s)
Handle: 
http://handle
.dtic.mil/100.2/ADA415886.

To my mind, it is a relational "fact-finding"
and extraction 
expedition. I don't see it is as any different than a
traditional 
human-initiated search result displayed with visualization
tools. 
Some database providers offer this service to customers now;
see 
Thompson Scientific 
htt
p://scientific.thomson.com/press/2005/8298419/ .

Data/text-mining also has application in the information
commons. 
For a discussion of the not-too-distant future, see Judy
Hilden's 
article: "Will the Future Bring Even More Important
Copyright 
Issues Than The Ones Raised by Online File-Swapping?"
in FindLaw 
Writ, 24 May 2005. 
htt
p://writ.news.findlaw.com/hilden/20050524.html

  "The issues are as simple and fundamental as they
are troubling:
  Exactly how much content may be copied on the Internet -
and
  of what kind -- before copyright is infringed? And more
deeply,
  when is content "copied" in the first place
when it comes to the
  Internet? Does the fact that the copying is done via a
machine
  editor - not a human editor - make a difference? "

Bonnie Klein

-----Original Message-----
[mailto:owner-liblicense-llists.yale.edu] On Behalf
Of Joseph J.
Esposito
Sent: Monday, May 29, 2006 4:57 PM
To: liblicense-llists.yale.edu
Subject: Data- and text-mining licensing

I have been involved in a number of discussions concerning
data 
and text mining recently and wonder if anyone has any
experience 
with these topics that they would like to share.  The basic 
question is whether the license for an electronic resource
in a 
form suitable to be read by humans extends as well to a
license 
for machine-reading.

The area of data and text mining for scholarly materials is
a new 
one, at least to me.  My understanding is that materials 
(research data, user data, published articles, books, etc.)
can 
be gathered together in such a way as to enable robots to
sift 
through them and identify patterns and themes. These new 
patterns--effectively robot-generated discoveries--may
include 
things that are not present in any single document in the 
collection.  Thus, the collection is greater than the sum of
its 
parts, but that greater value is only perceptible by
machines. 
This past week I heard an excellent presentation (it is not
yet 
online, but when the link becomes available, I will post it)
by a 
biostatistician, who commented that human access to such 
databases is "of low value," in contrast to the
"higher value of 
robot access."

Data and text mining are sometimes being discussed in the
context 
of the idea of "Web 2.0," but I think this is a
mistake.  Web 2.0 
is a concept of Tim O'Reilly's to describe the emerging
practices 
on the Internet today in the areas of community-building and

user-generated content.  Web 2.0 is a metaphor, not a
technical 
specification--but a very valuable metaphor. O'Reilly, for 
example, distinguishes between the early Web (his 1.0) and
the 
evolving Web by contrasting Encyclopaedia Britannica and the

Wikipedia. Both 1.0 and 2.0, however, share the fact that
the 
users are humans.  Data mining is a game for machines.  It
would 
be inaccurate to call it "Web 3.0" because
machines don't require 
a Web interface at all.  Web 2.0 is post-modern, but
data-mining 
is post-human.  Today's neologism:  the Post Human
Internet, or 
PHUNET for short, pronounced either FOO-net or (my
preference) 
PEE-YOU-net.  See Charles Stross's novel Accelerando.

Whether or not database mining of this kind will yield the
kind 
of new insights some believe it will, I do not know, but it
would 
be useful for the rights situation to be clarified early on
to 
fend off litigation at a later time.  It seems likely to me
that 
publishers will begin to separate human- and
machine-readable 
rights, just as they distinguish between subscriptions for 
libraries and individuals.  There is an interesting
precedent put 
forward by some members of the library community, who argue
that 
it is reasonable for publishers to charge for hardcopy, but 
electronic materials should be free.  It is conceivable that
over 
time the "low value" of human-readable rights
will become Open 
Access, leaving the higher value PHUNET rights for
aggressive 
economic exploitation.  It boggles the mind to think what a
large 
collection of science articles could be worth some day.

Joe Esposito

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )