Hi
i think this patch will make it way easier to configure
nutch, crawl dir
will be read from
nutch-default.xml instead of a relative path from where it
has been executed
So nutch-default.xml will have its
<property>
<name>searcher.dir</name>
<value>PATH_TO_CRAWL_DIR</value>
<description>
and this value will be used instead
Index: nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java
============================================================
=======
--- nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java
(Revision 436809)
+++ nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java
(Arbeitskopie)
 -53,10
+53,12 
Configuration conf = NutchConfiguration.create();
conf.addDefaultResource("crawl-tool.xml");
+
conf.addDefaultResource("nutch-default.xml");
JobConf job = new NutchJob(conf);
Path rootUrlDir = null;
- Path dir = new Path("crawl-" + getDate());
+ String path2crawlDir =
conf.get("searcher.dir");
+ Path dir = new Path(path2crawlDir);
int threads =
job.getInt("fetcher.threads.fetch", 10);
int depth = 5;
int topN = Integer.MAX_VALUE;
and this patch will make the CrawlDbReader find that crawl
directory
Index:
nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java
============================================================
=======
---
nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java
(Revision 436809)
+++
nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java
(Arbeitskopie)
 -406,8
+406,10 
return;
}
String param = null;
- String crawlDb = args[0];
+ //String crawlDb = args[0];
Configuration conf = NutchConfiguration.create();
+
conf.addDefaultResource("nutch-default.xml");
+ String crawlDb = conf.get("searcher.dir") +
"/crawldb";
for (int i = 1; i < args.length; i++) {
if (args[i].equals("-stats")) {
dbr.processStatJob(crawlDb, conf);
WDYT
thanks
David
|