List Info

Thread: Problems fetching a high number of sites




Problems fetching a high number of sites
user name
2006-05-24 17:23:21
Hello,

I?m currently having a problem fetching a high number of
sites using Nutch
0.7.1. The only configuration change made in nutch-site.xml
was
fetcher.threads.fetch = 40, the rest is default. The
following is the error
output when attempting to fetch 10 million pages:

060524 094711 SEVERE error writing
output:java.lang.OutOfMemoryError
java.lang.OutOfMemoryError
060524 094711 SEVERE error writing
output:java.io.IOException: key out of order:
 6430941 after 6430941
java.io.IOException: key out of order: 6430941 after 6430941
        at
org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134
)
        at
org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at
org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:3
9)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fe
tcher.java:280)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(F
etcher.java:261)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.j
ava:148)
060524 094722 SEVERE error writing
output:java.io.IOException: key out of order:
6430941 after 6430941
java.io.IOException: key out of order: 6430941 after 6430941
        at
org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134
)
        at
org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at
org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:3
9)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fe
tcher.java:280)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(F
etcher.java:261)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.j
ava:148)

As you can see it stops fetching at about 6430941 sites,
before this the process
would go slower and slower starting when it hit about 3
million.

From the error type, it looks like we are dealing with
memory. The Nutch process
never comes close to using up all my free system memory,
only about 25%. My
question is, would this be corrected by allotting more
memory to the Nutch
fetcher process (the java command) and if so how would this
be done or is there
something that needs to be corrected in the configuration
files?

Thanks,

Sean Dean

------------------------------------------------------------
----
This message was sent using IMP, the Internet Messaging
Program.
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )