List Info

Thread: Re: Crawl not crawling entire page




Re: Crawl not crawling entire page
country flaguser name
United States
2007-03-22 11:18:57
Thus saith Mike Howarth:
> I've already played around with differing depths
generally from 3 to 10 and
> have had no distinguisable difference in results
>......
> Anymore ideas?



I fought with a similar problem for quite a while.  I
suggest changing 2 things in your nutch-site.xml

The http.content.limit will prevent nutch from truncating
the page.  As long as your pages aren't so big that you're
going to kill the machine you're using, removing the
truncation should work.

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded
content, in bytes.
  If this value is nonnegative (>=0), content longer than
it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

Second, by default, nutch only crawls the first 100 links it
encounters on a page. So if you set db.max.outlinks.per.page
to -1, it will crawl all the links.

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that
we'll process for a page.
  If this value is nonnegative (>=0), at most
db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will
be processed.
  </description>
</property>


I hope this helps!

Ann




 
____________________________________________________________
________________________
We won't tell. Get more on shows you hate to love 
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.c
om/collections/265 
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )