Thus saith Mike Howarth:
> I've already played around with differing depths
generally from 3 to 10 and
> have had no distinguisable difference in results
>......
> Anymore ideas?
I fought with a similar problem for quite a while. I
suggest changing 2 things in your nutch-site.xml
The http.content.limit will prevent nutch from truncating
the page. As long as your pages aren't so big that you're
going to kill the machine you're using, removing the
truncation should work.
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded
content, in bytes.
If this value is nonnegative (>=0), content longer than
it will be truncated;
otherwise, no truncation at all.
</description>
</property>
Second, by default, nutch only crawls the first 100 links it
encounters on a page. So if you set db.max.outlinks.per.page
to -1, it will crawl all the links.
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description>The maximum number of outlinks that
we'll process for a page.
If this value is nonnegative (>=0), at most
db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will
be processed.
</description>
</property>
I hope this helps!
Ann
____________________________________________________________
________________________
We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.c
om/collections/265 |