List Info

Thread: help for a nutch beginner




help for a nutch beginner
user name
2007-11-06 09:06:42
I am conducting web research, and think that nutch will be a
useful tool to
aid my quest for information. I am interested in performing
a large crawl
(100 million pages+), analyzing the contents of these pages,
including
building a link graph. I have figured out how to get a large
list of pages
with fetch, then boot strap the to crawl list and re-crawl
as per
http://
lucene.apache.org/nutch/tutorial.html.
If this isnt the best way to perform a large crawl, please
provide
suggestions. I dont know if Nutch has any tools for building
a web graph,
but i would have no trouble building it on my own, if i knew
how to access
the pages contents. Unfortunately I have no idea how to do
this. once pages
are fetched, how does one view the HTML data?
Re: help for a nutch beginner
user name
2007-11-06 15:30:40
Josh Attenberg wrote:
> once pages
> are fetched, how does one view the HTML data?

nutch readseg

You might have to write your own version of SegmentReader
(or modify the 
existing one) to do exactly what you want, though.

Cheers,
Carl.

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )