I am conducting web research, and think that nutch will be a
useful tool to
aid my quest for information. I am interested in performing
a large crawl
(100 million pages+), analyzing the contents of these pages,
including
building a link graph. I have figured out how to get a large
list of pages
with fetch, then boot strap the to crawl list and re-crawl
as per
http://
lucene.apache.org/nutch/tutorial.html.
If this isnt the best way to perform a large crawl, please
provide
suggestions. I dont know if Nutch has any tools for building
a web graph,
but i would have no trouble building it on my own, if i knew
how to access
the pages contents. Unfortunately I have no idea how to do
this. once pages
are fetched, how does one view the HTML data?
|