Hi
I have changed the protocol-http plugin so that Nutch will
read from local
file system, instead of from the Internet, on those
already-crawled pages.
(I tried to use FILE:// protocol, but it seemed to me the
interconnection
information among pages were lost). Right now, I have made
it work, but
it's very slow. It took 10 minutes executing
"fetch" command on 400 pages.
And I was on a 4 CPU box with 4 threads. I am wondering if
this is normal,
because this is euqal to 400 hours/box to read 1 million
pages, which is
>15 days.
Any suggestion will be appreciated.
Zhen
|