List Info

Thread: reading crawl dir from nutch-default.xml




reading crawl dir from nutch-default.xml
user name
2006-08-25 14:26:29
Hi

i think this patch will make it way easier to configure
nutch, crawl dir
will be read from
nutch-default.xml instead of a relative path from where it
has been executed
So nutch-default.xml will have its
<property>
  <name>searcher.dir</name>
  <value>PATH_TO_CRAWL_DIR</value>
  <description>
and this value will be used instead

Index: nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java
============================================================
=======
--- nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java    
  
(Revision 436809)
+++ nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java    
  
(Arbeitskopie)
 -53,10
+53,12 

     Configuration conf = NutchConfiguration.create();
     conf.addDefaultResource("crawl-tool.xml");
+   
conf.addDefaultResource("nutch-default.xml");
     JobConf job = new NutchJob(conf);

     Path rootUrlDir = null;
-    Path dir = new Path("crawl-" + getDate());
+    String path2crawlDir =
conf.get("searcher.dir");
+    Path dir = new Path(path2crawlDir);
     int threads =
job.getInt("fetcher.threads.fetch", 10);
     int depth = 5;
     int topN = Integer.MAX_VALUE;


and this patch will make the CrawlDbReader find that crawl
directory

Index:
nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java
============================================================
=======
---
nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java
      
(Revision 436809)
+++
nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java
      
(Arbeitskopie)
 -406,8
+406,10 
       return;
     }
     String param = null;
-    String crawlDb = args[0];
+    //String crawlDb = args[0];
     Configuration conf = NutchConfiguration.create();
+   
conf.addDefaultResource("nutch-default.xml");
+    String crawlDb = conf.get("searcher.dir") +
"/crawldb";
     for (int i = 1; i < args.length; i++) {
       if (args[i].equals("-stats")) {
         dbr.processStatJob(crawlDb, conf);



WDYT

thanks

David
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )