Hi,
May be some people will find that posting interesting.
Webspam is one of the biggest issues or nutch for whole web
crawls
from my POV.
Greetings,
Stefan
>
> During AIRWeb'06 we announced the availability of the
collection.
>
> We are currently planning a Web Spam challenge based on
the dataset we
> have built. I assume most of you will be interested on
this, so I have
> moved the "webspam-volunteers" list to
"webspam-announces". If you do
> not want to be in this new
"webspam-announces" list, please send me an
> e-mail.
>
> This was shown during AIRWeb in Seattle:
>
>
............................................................
.
>
> Web Spam Collection Available
> August 10th, 2006
>
> We are pleased to announce the availability of a public
collection for
> research on Web spam. This collection is the result of
efforts by a
> team of volunteers:
>
> Thiago Alves Antonio Gulli Tamas Sarlos
> Luca Becchetti Zoltan Gyongyi Mike Thelwall
> Paolo Boldi Thomas Lavergn Belle Tseng
> Paul Chirita Alex Ntoulas Tanguy Urvoy
> Mirel Cosulschi Josiane-Xavier Parreira Wenzhong Zhao
> Brian Davison Xiaoguang Qi
> Pascal Filoche Massimo Santini
>
> The corpus is a large set of Web pages in 11,000 {\tt
.uk} hosts
> downloaded in May 2006 by the Laboratory of Web
Algorithmics,
> Universit{\`a} degli Studi di Milano. The labelling
process was
> coordinated by Carlos Castillo working at the
Algorithmic Engineering
> group at Universit{\`a} di Roma ``La Sapienza'' The
project was funded
> by the DELIS project (Dynamically Evolving, Large Scale
Information
> Systems).
>
> Volunteers were provided with a set of guidelines and
were asked to
> mark a set of hosts as either normal, spam, or
borderline. The
> collection includes about 6,700 judgments done by the
volunteers and
> can be used for testing link-based and content-based
Web spam
> detection and demotion techniques.
>
> More information is available in our Web page,
including the
> guidelines given to the human judges, the instructions
for obtaining
> the links and contents of the pages in this collection,
and the
> contact information for questions and comments.
>
> http://aeser
ver.dis.uniroma1.it/webspam/
>
> If you use this data set please subscribe to our
mailing list by
> sending an e-mail to
webspam-announces-subscribe@yahoogroups.com.
>
> --
> Carlos Castillo
> Universita di Roma "La Sapienza"
> Rome, ITALY
>
>
>
>
>
> Yahoo! Groups Links
>
> <*> To visit your group on the web, go to:
> http
://groups.yahoo.com/group/webspam-announces/
>
> <*> To unsubscribe from this group, send an email
to:
> webspam-announces-unsubscribe@yahoogroups.com
>
> <*> Your use of Yahoo! Groups is subject to:
> http://docs.yahoo.c
om/info/terms/
>
>
>
>
|