List Info

Thread: nutch nightly: IllegalArgumentException: Illegal Capacity: -1




nutch nightly: IllegalArgumentException: Illegal Capacity: -1
country flaguser name
United States
2007-08-09 16:32:31
I can't seem to get the nightly build to work!  It looks
like an error that I was getting under cygwin is also
haunting me under BSD.  Am I doing something very wrong?  I
have tried this from scratch about two or three times now
and I still get this error in my hadoop.log:
             Illegal Capacity: -1


In these posts I was trying the nightly build with cygwin:

http://www.mai
l-archive.com/nutch-userlucene.apache.org/msg08955.html

http://www.mai
l-archive.com/nutch-userlucene.apache.org/msg08950.html

Now, I have installed a nutch nightly under BSD as follows:

$ cd /usr/tmp2
$ svn co ht
tp://svn.apache.org/repos/asf/lucene/nutch/trunk -r

$ mv trunk nutch-trunk
$ cd nutch-trunk
$ ant clean
$ ant -verbose
set NUTCH_HOME to /usr/tmp2/nutch_trunk
modify conf/nutch-site.xml
modify conf/crawl-urlfilter.txt
modify conf/log4j.properties

Just to be sure, I also ran this in /usr/tmp2/nutch-trunk:
$ svn up -r HEAD
$ ant clean
$ ant

I am unable to do an "intranet" style crawl. 
Here's what it looks like on the console:

$ bin/nutch crawl /usr/tmp2/urls.txt -dir /usr/tmp2/100sites
-depth 4 -topN 5
crawl started in: /usr/tmp2/100sites
rootUrlDir = /usr/tmp2/urls.txt
threads = 10
depth = 4
topN = 5
Injector: starting
Injector: crawlDb: /usr/tmp2/100sites/crawldb
Injector: urlDir: /usr/tmp2/urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
/usr/tmp2/100sites/segments/20070809141119
Generator: filtering: false
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one
partition.
Generator: Partitioning selected urls by host, for
politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: /usr/tmp2/100sites/segments/20070809141119
Fetcher: threads: 10
fetching http://ec.eur
opa.eu/grants/index_en.htm
fetching http://ec.europa.eu/information_society/media/index_en
.htm
fetching http://filmfinancing.org/
fetching htt
p://dedo.delaware.gov/filmoffice/default.shtml
fetching http://filmnanaimo.com/
Exception in thread "main" java.io.IOException:
Job failed!
        at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604
)
        at
org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:499)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

Here's hadoop.log:

2007-08-09 13:55:57,428 INFO  crawl.Crawl - crawl started
in: /usr/tmp2/100sites
2007-08-09 13:55:57,430 INFO  crawl.Crawl - rootUrlDir =
/usr/tmp2/urls.txt
2007-08-09 13:55:57,431 INFO  crawl.Crawl - threads = 10
2007-08-09 13:55:57,431 INFO  crawl.Crawl - depth = 4
2007-08-09 13:55:57,431 INFO  crawl.Crawl - topN = 5
2007-08-09 13:55:57,542 INFO  crawl.Injector - Injector:
starting
2007-08-09 13:55:57,542 INFO  crawl.Injector - Injector:
crawlDb: /usr/tmp2/100sites/crawldb
2007-08-09 13:55:57,542 INFO  crawl.Injector - Injector:
urlDir: /usr/tmp2/urls.txt
2007-08-09 13:55:57,543 INFO  crawl.Injector - Injector:
Converting injected urls to crawl db entries.
2007-08-09 13:55:58,613 INFO  plugin.PluginRepository -
Plugins: looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
Plugin Auto-activation mode: [true]
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
Registered Plugins:
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -    
CyberNeko HTML Parser (lib-nekohtml)
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -    
Site Query Filter (query-site)
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -    
Basic URL Normalizer (urlnormalizer-basic)
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -    
Html Parse Plug-in (parse-html)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Regex URL Filter Framework (lib-regex-filter)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Feed Parse/Index/Query Plug-in (feed)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Basic Indexing Filter (index-basic)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Basic Summarizer Plug-in (summary-basic)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Text Parse Plug-in (parse-text)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
JavaScript Parser (parse-js)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Basic Query Filter (query-basic)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Regex URL Filter (urlfilter-regex)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
HTTP Framework (lib-http)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
XML Libraries (lib-xml)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
URL Query Filter (query-url)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Regex URL Normalizer (urlnormalizer-regex)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Http Protocol Plug-in (protocol-http)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
the nutch core extension points (nutch-extensionpoints)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
OPIC Scoring Plug-in (scoring-opic)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
Registered Extension-Points:
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch URL Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:55:58,979 WARN  regex.RegexURLNormalizer -
can't find rules for scope 'inject', using default
2007-08-09 13:56:00,516 INFO  crawl.Injector - Injector:
Merging injected urls into crawl db.
2007-08-09 13:56:01,723 WARN  util.NativeCodeLoader - Unable
to load native-hadoop library for your platform... using
builtin-java classes where applicable
2007-08-09 13:56:02,640 INFO  crawl.Injector - Injector:
done
2007-08-09 13:56:03,643 INFO  crawl.Generator - Generator:
Selecting best-scoring urls due for fetch.
2007-08-09 13:56:03,644 INFO  crawl.Generator - Generator:
starting
2007-08-09 13:56:03,644 INFO  crawl.Generator - Generator:
segment: /usr/tmp2/100sites/segments/20070809135603
2007-08-09 13:56:03,644 INFO  crawl.Generator - Generator:
filtering: false
2007-08-09 13:56:03,644 INFO  crawl.Generator - Generator:
topN: 5
2007-08-09 13:56:03,712 INFO  crawl.Generator - Generator:
jobtracker is 'local', generating exactly one partition.
2007-08-09 13:56:04,474 INFO  plugin.PluginRepository -
Plugins: looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
Plugin Auto-activation mode: [true]
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
Registered Plugins:
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
CyberNeko HTML Parser (lib-nekohtml)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Site Query Filter (query-site)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Basic URL Normalizer (urlnormalizer-basic)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Html Parse Plug-in (parse-html)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Regex URL Filter Framework (lib-regex-filter)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Feed Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Basic Indexing Filter (index-basic)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Basic Summarizer Plug-in (summary-basic)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Text Parse Plug-in (parse-text)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
JavaScript Parser (parse-js)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Basic Query Filter (query-basic)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Regex URL Filter (urlfilter-regex)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
HTTP Framework (lib-http)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
XML Libraries (lib-xml)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
URL Query Filter (query-url)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Regex URL Normalizer (urlnormalizer-regex)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Http Protocol Plug-in (protocol-http)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
the nutch core extension points (nutch-extensionpoints)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
OPIC Scoring Plug-in (scoring-opic)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
Registered Extension-Points:
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Nutch Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:04,656 INFO  plugin.PluginRepository -    
Nutch URL Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:04,656 INFO  plugin.PluginRepository -    
HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:04,656 INFO  plugin.PluginRepository -    
Nutch Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Nutch Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:04,703 INFO  crawl.FetchScheduleFactory -
Using FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule
2007-08-09 13:56:04,704 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000.0
2007-08-09 13:56:04,705 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000.0
2007-08-09 13:56:04,712 WARN  regex.RegexURLNormalizer -
can't find rules for scope 'partition', using default
2007-08-09 13:56:05,063 INFO  plugin.PluginRepository -
Plugins: looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
Plugin Auto-activation mode: [true]
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
Registered Plugins:
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
CyberNeko HTML Parser (lib-nekohtml)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
Site Query Filter (query-site)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
Basic URL Normalizer (urlnormalizer-basic)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
Html Parse Plug-in (parse-html)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
Regex URL Filter Framework (lib-regex-filter)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Feed Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Basic Indexing Filter (index-basic)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Basic Summarizer Plug-in (summary-basic)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Text Parse Plug-in (parse-text)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
JavaScript Parser (parse-js)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Basic Query Filter (query-basic)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Regex URL Filter (urlfilter-regex)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
HTTP Framework (lib-http)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
XML Libraries (lib-xml)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
URL Query Filter (query-url)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Regex URL Normalizer (urlnormalizer-regex)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Http Protocol Plug-in (protocol-http)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
the nutch core extension points (nutch-extensionpoints)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
OPIC Scoring Plug-in (scoring-opic)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
Registered Extension-Points:
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Nutch Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch URL Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:05,261 INFO  crawl.FetchScheduleFactory -
Using FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule
2007-08-09 13:56:05,261 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000.0
2007-08-09 13:56:05,261 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000.0
2007-08-09 13:56:06,425 INFO  crawl.Generator - Generator:
Partitioning selected urls by host, for politeness.
2007-08-09 13:56:07,195 INFO  plugin.PluginRepository -
Plugins: looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:07,335 INFO  plugin.PluginRepository -
Plugin Auto-activation mode: [true]
2007-08-09 13:56:07,335 INFO  plugin.PluginRepository -
Registered Plugins:
2007-08-09 13:56:07,335 INFO  plugin.PluginRepository -    
CyberNeko HTML Parser (lib-nekohtml)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Site Query Filter (query-site)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Basic URL Normalizer (urlnormalizer-basic)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Html Parse Plug-in (parse-html)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Regex URL Filter Framework (lib-regex-filter)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Feed Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Basic Indexing Filter (index-basic)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Basic Summarizer Plug-in (summary-basic)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Text Parse Plug-in (parse-text)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
JavaScript Parser (parse-js)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Basic Query Filter (query-basic)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Regex URL Filter (urlfilter-regex)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
HTTP Framework (lib-http)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
XML Libraries (lib-xml)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
URL Query Filter (query-url)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Regex URL Normalizer (urlnormalizer-regex)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Http Protocol Plug-in (protocol-http)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
the nutch core extension points (nutch-extensionpoints)
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -    
OPIC Scoring Plug-in (scoring-opic)
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -
Registered Extension-Points:
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -    
Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -    
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -    
Nutch Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch URL Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:07,339 INFO  plugin.PluginRepository -    
Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:07,339 INFO  plugin.PluginRepository -    
Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:07,358 WARN  regex.RegexURLNormalizer -
can't find rules for scope 'partition', using default
2007-08-09 13:56:08,184 INFO  crawl.Generator - Generator:
done.
2007-08-09 13:56:08,184 INFO  fetcher.Fetcher - Fetcher:
starting
2007-08-09 13:56:08,185 INFO  fetcher.Fetcher - Fetcher:
segment: /usr/tmp2/100sites/segments/20070809135603
2007-08-09 13:56:08,988 INFO  fetcher.Fetcher - Fetcher:
threads: 10
2007-08-09 13:56:08,992 INFO  plugin.PluginRepository -
Plugins: looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -
Plugin Auto-activation mode: [true]
2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -
Registered Plugins:
2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -    
CyberNeko HTML Parser (lib-nekohtml)
2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -    
Site Query Filter (query-site)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Basic URL Normalizer (urlnormalizer-basic)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Html Parse Plug-in (parse-html)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Regex URL Filter Framework (lib-regex-filter)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Feed Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Basic Indexing Filter (index-basic)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Basic Summarizer Plug-in (summary-basic)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Text Parse Plug-in (parse-text)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
JavaScript Parser (parse-js)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Basic Query Filter (query-basic)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Regex URL Filter (urlfilter-regex)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
HTTP Framework (lib-http)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
XML Libraries (lib-xml)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
URL Query Filter (query-url)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Regex URL Normalizer (urlnormalizer-regex)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Http Protocol Plug-in (protocol-http)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
the nutch core extension points (nutch-extensionpoints)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
OPIC Scoring Plug-in (scoring-opic)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
Registered Extension-Points:
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Nutch Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch URL Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:09,211 INFO  fetcher.Fetcher - fetching http://ec.eur
opa.eu/grants/index_en.htm
2007-08-09 13:56:09,213 INFO  fetcher.Fetcher - fetching http://ec.europa.eu/information_society/media/index_en
.htm
2007-08-09 13:56:09,213 INFO  fetcher.Fetcher - fetching http://filmfinancing.org/
2007-08-09 13:56:09,240 FATAL api.RobotRulesParser - Agent
we advertise (currentNutch) not listed first in
'http.robots.agents' property!
2007-08-09 13:56:09,240 INFO  http.Http - http.proxy.host =
null
2007-08-09 13:56:09,241 INFO  http.Http - http.proxy.port =
8080
2007-08-09 13:56:09,241 INFO  http.Http - http.timeout =
10000
2007-08-09 13:56:09,244 INFO  http.Http - http.content.limit
= 65536
2007-08-09 13:56:09,244 INFO  http.Http - http.agent =
currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk; http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
2007-08-09 13:56:09,244 INFO  http.Http -
protocol.plugin.check.blocking = true
2007-08-09 13:56:09,244 INFO  http.Http -
protocol.plugin.check.robots = true
2007-08-09 13:56:09,244 INFO  http.Http -
fetcher.server.delay = 3000
2007-08-09 13:56:09,245 INFO  http.Http - http.max.delays =
100
2007-08-09 13:56:09,212 INFO  fetcher.Fetcher - fetching http://filmnanaimo.com/
2007-08-09 13:56:09,213 INFO  fetcher.Fetcher - fetching htt
p://dedo.delaware.gov/filmoffice/default.shtml
2007-08-09 13:56:09,249 FATAL api.RobotRulesParser - Agent
we advertise (currentNutch) not listed first in
'http.robots.agents' property!
2007-08-09 13:56:09,249 INFO  http.Http - http.proxy.host =
null
2007-08-09 13:56:09,249 INFO  http.Http - http.proxy.port =
8080
2007-08-09 13:56:09,249 INFO  http.Http - http.timeout =
10000
2007-08-09 13:56:09,249 INFO  http.Http - http.content.limit
= 65536
2007-08-09 13:56:09,249 INFO  http.Http - http.agent =
currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk; http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
2007-08-09 13:56:09,249 INFO  http.Http -
protocol.plugin.check.blocking = true
2007-08-09 13:56:09,249 INFO  http.Http -
protocol.plugin.check.robots = true
2007-08-09 13:56:09,249 INFO  http.Http -
fetcher.server.delay = 3000
2007-08-09 13:56:09,258 INFO  http.Http - http.max.delays =
100
2007-08-09 13:56:09,259 FATAL api.RobotRulesParser - Agent
we advertise (currentNutch) not listed first in
'http.robots.agents' property!
2007-08-09 13:56:09,259 INFO  http.Http - http.proxy.host =
null
2007-08-09 13:56:09,259 INFO  http.Http - http.proxy.port =
8080
2007-08-09 13:56:09,259 INFO  http.Http - http.timeout =
10000
2007-08-09 13:56:09,259 INFO  http.Http - http.content.limit
= 65536
2007-08-09 13:56:09,259 INFO  http.Http - http.agent =
currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk; http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
2007-08-09 13:56:09,259 INFO  http.Http -
protocol.plugin.check.blocking = true
2007-08-09 13:56:09,259 INFO  http.Http -
protocol.plugin.check.robots = true
2007-08-09 13:56:09,259 INFO  http.Http -
fetcher.server.delay = 3000
2007-08-09 13:56:09,259 INFO  http.Http - http.max.delays =
100
2007-08-09 13:56:10,094 WARN  regex.RegexURLNormalizer -
can't find rules for scope 'outlink', using default
2007-08-09 13:56:10,123 INFO  crawl.SignatureFactory - Using
Signature impl: org.apache.nutch.crawl.MD5Signature
2007-08-09 13:56:17,661 INFO  plugin.PluginRepository -
Plugins: looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:17,797 INFO  plugin.PluginRepository -
Plugin Auto-activation mode: [true]
2007-08-09 13:56:17,797 INFO  plugin.PluginRepository -
Registered Plugins:
2007-08-09 13:56:17,797 INFO  plugin.PluginRepository -    
CyberNeko HTML Parser (lib-nekohtml)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Site Query Filter (query-site)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Basic URL Normalizer (urlnormalizer-basic)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Html Parse Plug-in (parse-html)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Regex URL Filter Framework (lib-regex-filter)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Feed Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Basic Indexing Filter (index-basic)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Basic Summarizer Plug-in (summary-basic)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Text Parse Plug-in (parse-text)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
JavaScript Parser (parse-js)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Basic Query Filter (query-basic)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Regex URL Filter (urlfilter-regex)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
HTTP Framework (lib-http)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
XML Libraries (lib-xml)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
URL Query Filter (query-url)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Regex URL Normalizer (urlnormalizer-regex)
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -    
Http Protocol Plug-in (protocol-http)
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -    
the nutch core extension points (nutch-extensionpoints)
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -    
OPIC Scoring Plug-in (scoring-opic)
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -
Registered Extension-Points:
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -    
Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch URL Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:18,911 WARN  mapred.LocalJobRunner -
job_blv6jf
java.lang.IllegalArgumentException: Illegal Capacity: -1
    at java.util.ArrayList.<init>(ArrayList.java:111)
    at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutput
Format.java:149)
    at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(Fetcher
OutputFormat.java:94)
    at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.jav
a:311)
    at
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(Identity
Reducer.java:41)
    at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:155)





     
____________________________________________________________
________________________
Fussy? Opinionated? Impossible to please? Perfect.  Join
Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.a
sp?a=7 
RE: nutch nightly: IllegalArgumentException: Illegal Capacity: -1
country flaguser name
United States
2007-09-05 16:46:11
Has there been any success with this?  I am running into the
exact same
problem, but on a Fedora 6 machine.  

<switch to after having a quick look at the source>

This is the line that tries to create an arraylist with an
initial
capacity of -1.  How inappropriate.
List<Entry<Text, CrawlDatum>> targets = new
ArrayList<Entry<Text,
CrawlDatum>>(outlinksToStore); 

Oh, outlinksToStore, that is likely related to the
db.max.outlinks.per.page configuration variable.  Funny
enough, that
was set to -1 since I want all outlinks to be processed and,
as the
description says, if the value is >=0 at most
db.max.outlinks.per.page
outlinks will be processed for a page; otherwise, all
outlinks will be
processed.  

My crawl appears to be running just fine now, after I set a
very large
value into db.max.outlinks.per.page.  Someone should look
into fixing
that.

Jeff


-----Original Message-----
From: Kai_testing Middleton [mailto:kai_testingyahoo.com] 
Sent: Thursday, August 09, 2007 5:33 PM
To: nutch user
Subject: nutch nightly: IllegalArgumentException: Illegal
Capacity: -1

I can't seem to get the nightly build to work!  It looks
like an error
that I was getting under cygwin is also haunting me under
BSD.  Am I
doing something very wrong?  I have tried this from scratch
about two
or three times now and I still get this error in my
hadoop.log:
             Illegal Capacity: -1


In these posts I was trying the nightly build with cygwin:

http://www.mai
l-archive.com/nutch-userlucene.apache.org/msg08955.html

http://www.mai
l-archive.com/nutch-userlucene.apache.org/msg08950.html

Now, I have installed a nutch nightly under BSD as follows:

$ cd /usr/tmp2
$ svn co ht
tp://svn.apache.org/repos/asf/lucene/nutch/trunk -r

$ mv trunk nutch-trunk
$ cd nutch-trunk
$ ant clean
$ ant -verbose
set NUTCH_HOME to /usr/tmp2/nutch_trunk
modify conf/nutch-site.xml
modify conf/crawl-urlfilter.txt
modify conf/log4j.properties

Just to be sure, I also ran this in /usr/tmp2/nutch-trunk:
$ svn up -r HEAD
$ ant clean
$ ant

I am unable to do an "intranet" style crawl. 
Here's what it looks like
on the console:

$ bin/nutch crawl /usr/tmp2/urls.txt -dir /usr/tmp2/100sites
-depth 4
-topN 5
crawl started in: /usr/tmp2/100sites
rootUrlDir = /usr/tmp2/urls.txt
threads = 10
depth = 4
topN = 5
Injector: starting
Injector: crawlDb: /usr/tmp2/100sites/crawldb
Injector: urlDir: /usr/tmp2/urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
/usr/tmp2/100sites/segments/20070809141119
Generator: filtering: false
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one
partition.
Generator: Partitioning selected urls by host, for
politeness.
Generator: done.
Fetcher: starting
Fetcher: segment:
/usr/tmp2/100sites/segments/20070809141119
Fetcher: threads: 10
fetching http://ec.eur
opa.eu/grants/index_en.htm
fetching http://ec.europa.eu/information_society/media/index_en
.htm
fetching http://filmfinancing.org/
fetching htt
p://dedo.delaware.gov/filmoffice/default.shtml
fetching http://filmnanaimo.com/
Exception in thread "main" java.io.IOException:
Job failed!
        at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604
)
        at
org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:499)
        at
org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

Here's hadoop.log:

2007-08-09 13:55:57,428 INFO  crawl.Crawl - crawl started
in:
/usr/tmp2/100sites
2007-08-09 13:55:57,430 INFO  crawl.Crawl - rootUrlDir =
/usr/tmp2/urls.txt
2007-08-09 13:55:57,431 INFO  crawl.Crawl - threads = 10
2007-08-09 13:55:57,431 INFO  crawl.Crawl - depth = 4
2007-08-09 13:55:57,431 INFO  crawl.Crawl - topN = 5
2007-08-09 13:55:57,542 INFO  crawl.Injector - Injector:
starting
2007-08-09 13:55:57,542 INFO  crawl.Injector - Injector:
crawlDb:
/usr/tmp2/100sites/crawldb
2007-08-09 13:55:57,542 INFO  crawl.Injector - Injector:
urlDir:
/usr/tmp2/urls.txt
2007-08-09 13:55:57,543 INFO  crawl.Injector - Injector:
Converting
injected urls to crawl db entries.
2007-08-09 13:55:58,613 INFO  plugin.PluginRepository -
Plugins:
looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
Plugin
Auto-activation mode: [true]
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
Registered
Plugins:
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -    
CyberNeko
HTML Parser (lib-nekohtml)
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -    
Site Query
Filter (query-site)
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -    
Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -    
Html Parse
Plug-in (parse-html)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Regex URL
Filter Framework (lib-regex-filter)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Basic
Indexing Filter (index-basic)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Basic
Summarizer Plug-in (summary-basic)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Text Parse
Plug-in (parse-text)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
JavaScript
Parser (parse-js)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Basic Query
Filter (query-basic)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Regex URL
Filter (urlfilter-regex)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
HTTP
Framework (lib-http)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
XML
Libraries (lib-xml)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
URL Query
Filter (query-url)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
Http
Protocol Plug-in (protocol-http)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
the nutch
core extension points (nutch-extensionpoints)
2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -    
OPIC
Scoring Plug-in (scoring-opic)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
Registered
Extension-Points:
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch
Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -    
Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:55:58,979 WARN  regex.RegexURLNormalizer -
can't find
rules for scope 'inject', using default
2007-08-09 13:56:00,516 INFO  crawl.Injector - Injector:
Merging
injected urls into crawl db.
2007-08-09 13:56:01,723 WARN  util.NativeCodeLoader - Unable
to load
native-hadoop library for your platform... using
builtin-java classes
where applicable
2007-08-09 13:56:02,640 INFO  crawl.Injector - Injector:
done
2007-08-09 13:56:03,643 INFO  crawl.Generator - Generator:
Selecting
best-scoring urls due for fetch.
2007-08-09 13:56:03,644 INFO  crawl.Generator - Generator:
starting
2007-08-09 13:56:03,644 INFO  crawl.Generator - Generator:
segment:
/usr/tmp2/100sites/segments/20070809135603
2007-08-09 13:56:03,644 INFO  crawl.Generator - Generator:
filtering:
false
2007-08-09 13:56:03,644 INFO  crawl.Generator - Generator:
topN: 5
2007-08-09 13:56:03,712 INFO  crawl.Generator - Generator:
jobtracker
is 'local', generating exactly one partition.
2007-08-09 13:56:04,474 INFO  plugin.PluginRepository -
Plugins:
looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
Plugin
Auto-activation mode: [true]
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
Registered
Plugins:
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
CyberNeko
HTML Parser (lib-nekohtml)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Site Query
Filter (query-site)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Html Parse
Plug-in (parse-html)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Regex URL
Filter Framework (lib-regex-filter)
2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -    
Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Basic
Indexing Filter (index-basic)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Basic
Summarizer Plug-in (summary-basic)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Text Parse
Plug-in (parse-text)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
JavaScript
Parser (parse-js)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Basic Query
Filter (query-basic)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Regex URL
Filter (urlfilter-regex)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
HTTP
Framework (lib-http)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
XML
Libraries (lib-xml)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
URL Query
Filter (query-url)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Http
Protocol Plug-in (protocol-http)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
the nutch
core extension points (nutch-extensionpoints)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
OPIC
Scoring Plug-in (scoring-opic)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
Registered
Extension-Points:
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -    
Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:04,656 INFO  plugin.PluginRepository -    
Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:04,656 INFO  plugin.PluginRepository -    
HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:04,656 INFO  plugin.PluginRepository -    
Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Nutch
Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -    
Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:04,703 INFO  crawl.FetchScheduleFactory -
Using
FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule
2007-08-09 13:56:04,704 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000.0
2007-08-09 13:56:04,705 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000.0
2007-08-09 13:56:04,712 WARN  regex.RegexURLNormalizer -
can't find
rules for scope 'partition', using default
2007-08-09 13:56:05,063 INFO  plugin.PluginRepository -
Plugins:
looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
Plugin
Auto-activation mode: [true]
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
Registered
Plugins:
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
CyberNeko
HTML Parser (lib-nekohtml)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
Site Query
Filter (query-site)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
Html Parse
Plug-in (parse-html)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -    
Regex URL
Filter Framework (lib-regex-filter)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Basic
Indexing Filter (index-basic)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Basic
Summarizer Plug-in (summary-basic)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Text Parse
Plug-in (parse-text)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
JavaScript
Parser (parse-js)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Basic Query
Filter (query-basic)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Regex URL
Filter (urlfilter-regex)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
HTTP
Framework (lib-http)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
XML
Libraries (lib-xml)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
URL Query
Filter (query-url)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Http
Protocol Plug-in (protocol-http)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
the nutch
core extension points (nutch-extensionpoints)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
OPIC
Scoring Plug-in (scoring-opic)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
Registered
Extension-Points:
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -    
Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch
Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -    
Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:05,261 INFO  crawl.FetchScheduleFactory -
Using
FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule
2007-08-09 13:56:05,261 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000.0
2007-08-09 13:56:05,261 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000.0
2007-08-09 13:56:06,425 INFO  crawl.Generator - Generator:
Partitioning
selected urls by host, for politeness.
2007-08-09 13:56:07,195 INFO  plugin.PluginRepository -
Plugins:
looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:07,335 INFO  plugin.PluginRepository -
Plugin
Auto-activation mode: [true]
2007-08-09 13:56:07,335 INFO  plugin.PluginRepository -
Registered
Plugins:
2007-08-09 13:56:07,335 INFO  plugin.PluginRepository -    
CyberNeko
HTML Parser (lib-nekohtml)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Site Query
Filter (query-site)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Html Parse
Plug-in (parse-html)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Regex URL
Filter Framework (lib-regex-filter)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Basic
Indexing Filter (index-basic)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Basic
Summarizer Plug-in (summary-basic)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Text Parse
Plug-in (parse-text)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
JavaScript
Parser (parse-js)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Basic Query
Filter (query-basic)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Regex URL
Filter (urlfilter-regex)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
HTTP
Framework (lib-http)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
XML
Libraries (lib-xml)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
URL Query
Filter (query-url)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
Http
Protocol Plug-in (protocol-http)
2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -    
the nutch
core extension points (nutch-extensionpoints)
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -    
OPIC
Scoring Plug-in (scoring-opic)
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -
Registered
Extension-Points:
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -    
Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -    
Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -    
Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Nutch
Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -    
Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:07,339 INFO  plugin.PluginRepository -    
Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:07,339 INFO  plugin.PluginRepository -    
Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:07,358 WARN  regex.RegexURLNormalizer -
can't find
rules for scope 'partition', using default
2007-08-09 13:56:08,184 INFO  crawl.Generator - Generator:
done.
2007-08-09 13:56:08,184 INFO  fetcher.Fetcher - Fetcher:
starting
2007-08-09 13:56:08,185 INFO  fetcher.Fetcher - Fetcher:
segment:
/usr/tmp2/100sites/segments/20070809135603
2007-08-09 13:56:08,988 INFO  fetcher.Fetcher - Fetcher:
threads: 10
2007-08-09 13:56:08,992 INFO  plugin.PluginRepository -
Plugins:
looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -
Plugin
Auto-activation mode: [true]
2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -
Registered
Plugins:
2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -    
CyberNeko
HTML Parser (lib-nekohtml)
2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -    
Site Query
Filter (query-site)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Html Parse
Plug-in (parse-html)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Regex URL
Filter Framework (lib-regex-filter)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Basic
Indexing Filter (index-basic)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Basic
Summarizer Plug-in (summary-basic)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Text Parse
Plug-in (parse-text)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
JavaScript
Parser (parse-js)
2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -    
Basic Query
Filter (query-basic)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Regex URL
Filter (urlfilter-regex)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
HTTP
Framework (lib-http)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
XML
Libraries (lib-xml)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
URL Query
Filter (query-url)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Http
Protocol Plug-in (protocol-http)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
the nutch
core extension points (nutch-extensionpoints)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
OPIC
Scoring Plug-in (scoring-opic)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
Registered
Extension-Points:
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -    
Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch
Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -    
Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:09,211 INFO  fetcher.Fetcher - fetching
http://ec.eur
opa.eu/grants/index_en.htm
2007-08-09 13:56:09,213 INFO  fetcher.Fetcher - fetching
http://ec.europa.eu/information_society/media/index_en
.htm
2007-08-09 13:56:09,213 INFO  fetcher.Fetcher - fetching
http://filmfinancing.org/
2007-08-09 13:56:09,240 FATAL api.RobotRulesParser - Agent
we advertise
(currentNutch) not listed first in 'http.robots.agents'
property!
2007-08-09 13:56:09,240 INFO  http.Http - http.proxy.host =
null
2007-08-09 13:56:09,241 INFO  http.Http - http.proxy.port =
8080
2007-08-09 13:56:09,241 INFO  http.Http - http.timeout =
10000
2007-08-09 13:56:09,244 INFO  http.Http - http.content.limit
= 65536
2007-08-09 13:56:09,244 INFO  http.Http - http.agent =
currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk;
http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
2007-08-09 13:56:09,244 INFO  http.Http -
protocol.plugin.check.blocking = true
2007-08-09 13:56:09,244 INFO  http.Http -
protocol.plugin.check.robots
= true
2007-08-09 13:56:09,244 INFO  http.Http -
fetcher.server.delay = 3000
2007-08-09 13:56:09,245 INFO  http.Http - http.max.delays =
100
2007-08-09 13:56:09,212 INFO  fetcher.Fetcher - fetching
http://filmnanaimo.com/
2007-08-09 13:56:09,213 INFO  fetcher.Fetcher - fetching
htt
p://dedo.delaware.gov/filmoffice/default.shtml
2007-08-09 13:56:09,249 FATAL api.RobotRulesParser - Agent
we advertise
(currentNutch) not listed first in 'http.robots.agents'
property!
2007-08-09 13:56:09,249 INFO  http.Http - http.proxy.host =
null
2007-08-09 13:56:09,249 INFO  http.Http - http.proxy.port =
8080
2007-08-09 13:56:09,249 INFO  http.Http - http.timeout =
10000
2007-08-09 13:56:09,249 INFO  http.Http - http.content.limit
= 65536
2007-08-09 13:56:09,249 INFO  http.Http - http.agent =
currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk;
http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
2007-08-09 13:56:09,249 INFO  http.Http -
protocol.plugin.check.blocking = true
2007-08-09 13:56:09,249 INFO  http.Http -
protocol.plugin.check.robots
= true
2007-08-09 13:56:09,249 INFO  http.Http -
fetcher.server.delay = 3000
2007-08-09 13:56:09,258 INFO  http.Http - http.max.delays =
100
2007-08-09 13:56:09,259 FATAL api.RobotRulesParser - Agent
we advertise
(currentNutch) not listed first in 'http.robots.agents'
property!
2007-08-09 13:56:09,259 INFO  http.Http - http.proxy.host =
null
2007-08-09 13:56:09,259 INFO  http.Http - http.proxy.port =
8080
2007-08-09 13:56:09,259 INFO  http.Http - http.timeout =
10000
2007-08-09 13:56:09,259 INFO  http.Http - http.content.limit
= 65536
2007-08-09 13:56:09,259 INFO  http.Http - http.agent =
currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk;
http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
2007-08-09 13:56:09,259 INFO  http.Http -
protocol.plugin.check.blocking = true
2007-08-09 13:56:09,259 INFO  http.Http -
protocol.plugin.check.robots
= true
2007-08-09 13:56:09,259 INFO  http.Http -
fetcher.server.delay = 3000
2007-08-09 13:56:09,259 INFO  http.Http - http.max.delays =
100
2007-08-09 13:56:10,094 WARN  regex.RegexURLNormalizer -
can't find
rules for scope 'outlink', using default
2007-08-09 13:56:10,123 INFO  crawl.SignatureFactory - Using
Signature
impl: org.apache.nutch.crawl.MD5Signature
2007-08-09 13:56:17,661 INFO  plugin.PluginRepository -
Plugins:
looking in: /usr/tmp2/nutch-trunk/build/plugins
2007-08-09 13:56:17,797 INFO  plugin.PluginRepository -
Plugin
Auto-activation mode: [true]
2007-08-09 13:56:17,797 INFO  plugin.PluginRepository -
Registered
Plugins:
2007-08-09 13:56:17,797 INFO  plugin.PluginRepository -    
CyberNeko
HTML Parser (lib-nekohtml)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Site Query
Filter (query-site)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Basic URL
Normalizer (urlnormalizer-basic)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Html Parse
Plug-in (parse-html)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Regex URL
Filter Framework (lib-regex-filter)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Feed
Parse/Index/Query Plug-in (feed)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Basic
Indexing Filter (index-basic)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Basic
Summarizer Plug-in (summary-basic)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Text Parse
Plug-in (parse-text)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
JavaScript
Parser (parse-js)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Basic Query
Filter (query-basic)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Regex URL
Filter (urlfilter-regex)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
HTTP
Framework (lib-http)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
XML
Libraries (lib-xml)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
URL Query
Filter (query-url)
2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -    
Regex URL
Normalizer (urlnormalizer-regex)
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -    
Http
Protocol Plug-in (protocol-http)
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -    
the nutch
core extension points (nutch-extensionpoints)
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -    
OPIC
Scoring Plug-in (scoring-opic)
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -
Registered
Extension-Points:
2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -    
Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch
Content Parser (org.apache.nutch.parse.Parser)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -    
Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2007-08-09 13:56:18,911 WARN  mapred.LocalJobRunner -
job_blv6jf
java.lang.IllegalArgumentException: Illegal Capacity: -1
    at java.util.ArrayList.<init>(ArrayList.java:111)
    at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutput
Format.java
:149)
    at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(Fetcher
OutputForma
t.java:94)
    at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.jav
a:311)
    at
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(Identity
Reducer.jav
a:41)
    at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)

    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:155
)





 
____________________________________________________________
___________
_____________
Fussy? Opinionated? Impossible to please? Perfect.  Join
Yahoo!'s user
panel and lay it on us.
http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.a
sp?a=7 

Re: nutch nightly: IllegalArgumentException: Illegal Capacity: -1
user name
2007-09-07 02:34:42
On 9/6/07, Bolle, Jeffrey F. <jbollemitre.org> wrote:
> Has there been any success with this?  I am running
into the exact same
> problem, but on a Fedora 6 machine.
>
> <switch to after having a quick look at the
source>
>
> This is the line that tries to create an arraylist with
an initial
> capacity of -1.  How inappropriate.
> List<Entry<Text, CrawlDatum>> targets = new
ArrayList<Entry<Text,
> CrawlDatum>>(outlinksToStore);
>
> Oh, outlinksToStore, that is likely related to the
> db.max.outlinks.per.page configuration variable.  Funny
enough, that
> was set to -1 since I want all outlinks to be processed
and, as the
> description says, if the value is >=0 at most
db.max.outlinks.per.page
> outlinks will be processed for a page; otherwise, all
outlinks will be
> processed.
>
> My crawl appears to be running just fine now, after I
set a very large
> value into db.max.outlinks.per.page.  Someone should
look into fixing
> that.
>

My bad . I am
going to enter a JIRA and commit a fix for this soon.

> Jeff
>
>
> -----Original Message-----
> From: Kai_testing Middleton [mailto:kai_testingyahoo.com]
> Sent: Thursday, August 09, 2007 5:33 PM
> To: nutch user
> Subject: nutch nightly: IllegalArgumentException:
Illegal Capacity: -1
>
> I can't seem to get the nightly build to work!  It
looks like an error
> that I was getting under cygwin is also haunting me
under BSD.  Am I
> doing something very wrong?  I have tried this from
scratch about two
> or three times now and I still get this error in my
hadoop.log:
>              Illegal Capacity: -1
>
>
> In these posts I was trying the nightly build with
cygwin:
>
> http://www.mai
l-archive.com/nutch-userlucene.apache.org/msg08955.html
>
> http://www.mai
l-archive.com/nutch-userlucene.apache.org/msg08950.html
>
> Now, I have installed a nutch nightly under BSD as
follows:
>
> $ cd /usr/tmp2
> $ svn co ht
tp://svn.apache.org/repos/asf/lucene/nutch/trunk -r
> 
> $ mv trunk nutch-trunk
> $ cd nutch-trunk
> $ ant clean
> $ ant -verbose
> set NUTCH_HOME to /usr/tmp2/nutch_trunk
> modify conf/nutch-site.xml
> modify conf/crawl-urlfilter.txt
> modify conf/log4j.properties
>
> Just to be sure, I also ran this in
/usr/tmp2/nutch-trunk:
> $ svn up -r HEAD
> $ ant clean
> $ ant
>
> I am unable to do an "intranet" style crawl. 
Here's what it looks like
> on the console:
>
> $ bin/nutch crawl /usr/tmp2/urls.txt -dir
/usr/tmp2/100sites -depth 4
> -topN 5
> crawl started in: /usr/tmp2/100sites
> rootUrlDir = /usr/tmp2/urls.txt
> threads = 10
> depth = 4
> topN = 5
> Injector: starting
> Injector: crawlDb: /usr/tmp2/100sites/crawldb
> Injector: urlDir: /usr/tmp2/urls.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment:
/usr/tmp2/100sites/segments/20070809141119
> Generator: filtering: false
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly
one partition.
> Generator: Partitioning selected urls by host, for
politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment:
/usr/tmp2/100sites/segments/20070809141119
> Fetcher: threads: 10
> fetching http://ec.eur
opa.eu/grants/index_en.htm
> fetching http://ec.europa.eu/information_society/media/index_en
.htm
> fetching http://filmfinancing.org/
> fetching htt
p://dedo.delaware.gov/filmoffice/default.shtml
> fetching http://filmnanaimo.com/
> Exception in thread "main"
java.io.IOException: Job failed!
>         at
>
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604
)
>         at
org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:499)
>         at
org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
>
> Here's hadoop.log:
>
> 2007-08-09 13:55:57,428 INFO  crawl.Crawl - crawl
started in:
> /usr/tmp2/100sites
> 2007-08-09 13:55:57,430 INFO  crawl.Crawl - rootUrlDir
=
> /usr/tmp2/urls.txt
> 2007-08-09 13:55:57,431 INFO  crawl.Crawl - threads =
10
> 2007-08-09 13:55:57,431 INFO  crawl.Crawl - depth = 4
> 2007-08-09 13:55:57,431 INFO  crawl.Crawl - topN = 5
> 2007-08-09 13:55:57,542 INFO  crawl.Injector -
Injector: starting
> 2007-08-09 13:55:57,542 INFO  crawl.Injector -
Injector: crawlDb:
> /usr/tmp2/100sites/crawldb
> 2007-08-09 13:55:57,542 INFO  crawl.Injector -
Injector: urlDir:
> /usr/tmp2/urls.txt
> 2007-08-09 13:55:57,543 INFO  crawl.Injector -
Injector: Converting
> injected urls to crawl db entries.
> 2007-08-09 13:55:58,613 INFO  plugin.PluginRepository -
Plugins:
> looking in: /usr/tmp2/nutch-trunk/build/plugins
> 2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
Plugin
> Auto-activation mode: [true]
> 2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
Registered
> Plugins:
> 2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
    CyberNeko
> HTML Parser (lib-nekohtml)
> 2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
    Site Query
> Filter (query-site)
> 2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
    Basic URL
> Normalizer (urlnormalizer-basic)
> 2007-08-09 13:55:58,842 INFO  plugin.PluginRepository -
    Html Parse
> Plug-in (parse-html)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    Regex URL
> Filter Framework (lib-regex-filter)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    Feed
> Parse/Index/Query Plug-in (feed)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    Basic
> Indexing Filter (index-basic)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    Basic
> Summarizer Plug-in (summary-basic)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    Text Parse
> Plug-in (parse-text)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    JavaScript
> Parser (parse-js)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    Basic Query
> Filter (query-basic)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    Regex URL
> Filter (urlfilter-regex)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    HTTP
> Framework (lib-http)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    XML
> Libraries (lib-xml)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    URL Query
> Filter (query-url)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    Regex URL
> Normalizer (urlnormalizer-regex)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    Http
> Protocol Plug-in (protocol-http)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    the nutch
> core extension points (nutch-extensionpoints)
> 2007-08-09 13:55:58,850 INFO  plugin.PluginRepository -
    OPIC
> Scoring Plug-in (scoring-opic)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
Registered
> Extension-Points:
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch
> Online Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch
> Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Ontology
> Model Loader (org.apache.nutch.ontology.Ontology)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-08-09 13:55:58,851 INFO  plugin.PluginRepository -
    Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-08-09 13:55:58,979 WARN  regex.RegexURLNormalizer
- can't find
> rules for scope 'inject', using default
> 2007-08-09 13:56:00,516 INFO  crawl.Injector -
Injector: Merging
> injected urls into crawl db.
> 2007-08-09 13:56:01,723 WARN  util.NativeCodeLoader -
Unable to load
> native-hadoop library for your platform... using
builtin-java classes
> where applicable
> 2007-08-09 13:56:02,640 INFO  crawl.Injector -
Injector: done
> 2007-08-09 13:56:03,643 INFO  crawl.Generator -
Generator: Selecting
> best-scoring urls due for fetch.
> 2007-08-09 13:56:03,644 INFO  crawl.Generator -
Generator: starting
> 2007-08-09 13:56:03,644 INFO  crawl.Generator -
Generator: segment:
> /usr/tmp2/100sites/segments/20070809135603
> 2007-08-09 13:56:03,644 INFO  crawl.Generator -
Generator: filtering:
> false
> 2007-08-09 13:56:03,644 INFO  crawl.Generator -
Generator: topN: 5
> 2007-08-09 13:56:03,712 INFO  crawl.Generator -
Generator: jobtracker
> is 'local', generating exactly one partition.
> 2007-08-09 13:56:04,474 INFO  plugin.PluginRepository -
Plugins:
> looking in: /usr/tmp2/nutch-trunk/build/plugins
> 2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
Plugin
> Auto-activation mode: [true]
> 2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
Registered
> Plugins:
> 2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
    CyberNeko
> HTML Parser (lib-nekohtml)
> 2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
    Site Query
> Filter (query-site)
> 2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
    Basic URL
> Normalizer (urlnormalizer-basic)
> 2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
    Html Parse
> Plug-in (parse-html)
> 2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
    Regex URL
> Filter Framework (lib-regex-filter)
> 2007-08-09 13:56:04,654 INFO  plugin.PluginRepository -
    Feed
> Parse/Index/Query Plug-in (feed)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Basic
> Indexing Filter (index-basic)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Basic
> Summarizer Plug-in (summary-basic)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Text Parse
> Plug-in (parse-text)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    JavaScript
> Parser (parse-js)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Basic Query
> Filter (query-basic)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Regex URL
> Filter (urlfilter-regex)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    HTTP
> Framework (lib-http)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    XML
> Libraries (lib-xml)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    URL Query
> Filter (query-url)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Regex URL
> Normalizer (urlnormalizer-regex)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Http
> Protocol Plug-in (protocol-http)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    the nutch
> core extension points (nutch-extensionpoints)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    OPIC
> Scoring Plug-in (scoring-opic)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
Registered
> Extension-Points:
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2007-08-09 13:56:04,655 INFO  plugin.PluginRepository -
    Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-08-09 13:56:04,656 INFO  plugin.PluginRepository -
    Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2007-08-09 13:56:04,656 INFO  plugin.PluginRepository -
    HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2007-08-09 13:56:04,656 INFO  plugin.PluginRepository -
    Nutch
> Online Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -
    Nutch
> Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
> 2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -
    Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -
    Ontology
> Model Loader (org.apache.nutch.ontology.Ontology)
> 2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -
    Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-08-09 13:56:04,657 INFO  plugin.PluginRepository -
    Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-08-09 13:56:04,703 INFO 
crawl.FetchScheduleFactory - Using
> FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule
> 2007-08-09 13:56:04,704 INFO 
crawl.AbstractFetchSchedule -
> defaultInterval=2592000.0
> 2007-08-09 13:56:04,705 INFO 
crawl.AbstractFetchSchedule -
> maxInterval=7776000.0
> 2007-08-09 13:56:04,712 WARN  regex.RegexURLNormalizer
- can't find
> rules for scope 'partition', using default
> 2007-08-09 13:56:05,063 INFO  plugin.PluginRepository -
Plugins:
> looking in: /usr/tmp2/nutch-trunk/build/plugins
> 2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
Plugin
> Auto-activation mode: [true]
> 2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
Registered
> Plugins:
> 2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
    CyberNeko
> HTML Parser (lib-nekohtml)
> 2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
    Site Query
> Filter (query-site)
> 2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
    Basic URL
> Normalizer (urlnormalizer-basic)
> 2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
    Html Parse
> Plug-in (parse-html)
> 2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2007-08-09 13:56:05,207 INFO  plugin.PluginRepository -
    Regex URL
> Filter Framework (lib-regex-filter)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Feed
> Parse/Index/Query Plug-in (feed)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Basic
> Indexing Filter (index-basic)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Basic
> Summarizer Plug-in (summary-basic)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Text Parse
> Plug-in (parse-text)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    JavaScript
> Parser (parse-js)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Basic Query
> Filter (query-basic)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Regex URL
> Filter (urlfilter-regex)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    HTTP
> Framework (lib-http)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    XML
> Libraries (lib-xml)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    URL Query
> Filter (query-url)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Regex URL
> Normalizer (urlnormalizer-regex)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Http
> Protocol Plug-in (protocol-http)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    the nutch
> core extension points (nutch-extensionpoints)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    OPIC
> Scoring Plug-in (scoring-opic)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
Registered
> Extension-Points:
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2007-08-09 13:56:05,208 INFO  plugin.PluginRepository -
    Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -
    Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -
    Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -
    HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -
    Nutch
> Online Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -
    Nutch
> Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
> 2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -
    Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -
    Ontology
> Model Loader (org.apache.nutch.ontology.Ontology)
> 2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -
    Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-08-09 13:56:05,209 INFO  plugin.PluginRepository -
    Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-08-09 13:56:05,261 INFO 
crawl.FetchScheduleFactory - Using
> FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule
> 2007-08-09 13:56:05,261 INFO 
crawl.AbstractFetchSchedule -
> defaultInterval=2592000.0
> 2007-08-09 13:56:05,261 INFO 
crawl.AbstractFetchSchedule -
> maxInterval=7776000.0
> 2007-08-09 13:56:06,425 INFO  crawl.Generator -
Generator: Partitioning
> selected urls by host, for politeness.
> 2007-08-09 13:56:07,195 INFO  plugin.PluginRepository -
Plugins:
> looking in: /usr/tmp2/nutch-trunk/build/plugins
> 2007-08-09 13:56:07,335 INFO  plugin.PluginRepository -
Plugin
> Auto-activation mode: [true]
> 2007-08-09 13:56:07,335 INFO  plugin.PluginRepository -
Registered
> Plugins:
> 2007-08-09 13:56:07,335 INFO  plugin.PluginRepository -
    CyberNeko
> HTML Parser (lib-nekohtml)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Site Query
> Filter (query-site)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Basic URL
> Normalizer (urlnormalizer-basic)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Html Parse
> Plug-in (parse-html)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Regex URL
> Filter Framework (lib-regex-filter)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Feed
> Parse/Index/Query Plug-in (feed)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Basic
> Indexing Filter (index-basic)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Basic
> Summarizer Plug-in (summary-basic)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Text Parse
> Plug-in (parse-text)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    JavaScript
> Parser (parse-js)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Basic Query
> Filter (query-basic)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Regex URL
> Filter (urlfilter-regex)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    HTTP
> Framework (lib-http)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    XML
> Libraries (lib-xml)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    URL Query
> Filter (query-url)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Regex URL
> Normalizer (urlnormalizer-regex)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    Http
> Protocol Plug-in (protocol-http)
> 2007-08-09 13:56:07,336 INFO  plugin.PluginRepository -
    the nutch
> core extension points (nutch-extensionpoints)
> 2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -
    OPIC
> Scoring Plug-in (scoring-opic)
> 2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -
Registered
> Extension-Points:
> 2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -
    Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -
    Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2007-08-09 13:56:07,337 INFO  plugin.PluginRepository -
    Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -
    Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -
    Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -
    HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -
    Nutch
> Online Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -
    Nutch
> Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
> 2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -
    Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2007-08-09 13:56:07,338 INFO  plugin.PluginRepository -
    Ontology
> Model Loader (org.apache.nutch.ontology.Ontology)
> 2007-08-09 13:56:07,339 INFO  plugin.PluginRepository -
    Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-08-09 13:56:07,339 INFO  plugin.PluginRepository -
    Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-08-09 13:56:07,358 WARN  regex.RegexURLNormalizer
- can't find
> rules for scope 'partition', using default
> 2007-08-09 13:56:08,184 INFO  crawl.Generator -
Generator: done.
> 2007-08-09 13:56:08,184 INFO  fetcher.Fetcher -
Fetcher: starting
> 2007-08-09 13:56:08,185 INFO  fetcher.Fetcher -
Fetcher: segment:
> /usr/tmp2/100sites/segments/20070809135603
> 2007-08-09 13:56:08,988 INFO  fetcher.Fetcher -
Fetcher: threads: 10
> 2007-08-09 13:56:08,992 INFO  plugin.PluginRepository -
Plugins:
> looking in: /usr/tmp2/nutch-trunk/build/plugins
> 2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -
Plugin
> Auto-activation mode: [true]
> 2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -
Registered
> Plugins:
> 2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -
    CyberNeko
> HTML Parser (lib-nekohtml)
> 2007-08-09 13:56:09,154 INFO  plugin.PluginRepository -
    Site Query
> Filter (query-site)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
    Basic URL
> Normalizer (urlnormalizer-basic)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
    Html Parse
> Plug-in (parse-html)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
    Regex URL
> Filter Framework (lib-regex-filter)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
    Feed
> Parse/Index/Query Plug-in (feed)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
    Basic
> Indexing Filter (index-basic)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
    Basic
> Summarizer Plug-in (summary-basic)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
    Text Parse
> Plug-in (parse-text)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
    JavaScript
> Parser (parse-js)
> 2007-08-09 13:56:09,155 INFO  plugin.PluginRepository -
    Basic Query
> Filter (query-basic)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    Regex URL
> Filter (urlfilter-regex)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    HTTP
> Framework (lib-http)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    XML
> Libraries (lib-xml)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    URL Query
> Filter (query-url)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    Regex URL
> Normalizer (urlnormalizer-regex)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    Http
> Protocol Plug-in (protocol-http)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    the nutch
> core extension points (nutch-extensionpoints)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    OPIC
> Scoring Plug-in (scoring-opic)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
Registered
> Extension-Points:
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2007-08-09 13:56:09,156 INFO  plugin.PluginRepository -
    Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -
    Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -
    Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -
    HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -
    Nutch
> Online Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -
    Nutch
> Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
> 2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -
    Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -
    Ontology
> Model Loader (org.apache.nutch.ontology.Ontology)
> 2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -
    Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-08-09 13:56:09,157 INFO  plugin.PluginRepository -
    Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-08-09 13:56:09,211 INFO  fetcher.Fetcher -
fetching
> http://ec.eur
opa.eu/grants/index_en.htm
> 2007-08-09 13:56:09,213 INFO  fetcher.Fetcher -
fetching
> http://ec.europa.eu/information_society/media/index_en
.htm
> 2007-08-09 13:56:09,213 INFO  fetcher.Fetcher -
fetching
> http://filmfinancing.org/
> 2007-08-09 13:56:09,240 FATAL api.RobotRulesParser -
Agent we advertise
> (currentNutch) not listed first in 'http.robots.agents'
property!
> 2007-08-09 13:56:09,240 INFO  http.Http -
http.proxy.host = null
> 2007-08-09 13:56:09,241 INFO  http.Http -
http.proxy.port = 8080
> 2007-08-09 13:56:09,241 INFO  http.Http - http.timeout
= 10000
> 2007-08-09 13:56:09,244 INFO  http.Http -
http.content.limit = 65536
> 2007-08-09 13:56:09,244 INFO  http.Http - http.agent =
> currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk;
> http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
> 2007-08-09 13:56:09,244 INFO  http.Http -
> protocol.plugin.check.blocking = true
> 2007-08-09 13:56:09,244 INFO  http.Http -
protocol.plugin.check.robots
> = true
> 2007-08-09 13:56:09,244 INFO  http.Http -
fetcher.server.delay = 3000
> 2007-08-09 13:56:09,245 INFO  http.Http -
http.max.delays = 100
> 2007-08-09 13:56:09,212 INFO  fetcher.Fetcher -
fetching
> http://filmnanaimo.com/
> 2007-08-09 13:56:09,213 INFO  fetcher.Fetcher -
fetching
> htt
p://dedo.delaware.gov/filmoffice/default.shtml
> 2007-08-09 13:56:09,249 FATAL api.RobotRulesParser -
Agent we advertise
> (currentNutch) not listed first in 'http.robots.agents'
property!
> 2007-08-09 13:56:09,249 INFO  http.Http -
http.proxy.host = null
> 2007-08-09 13:56:09,249 INFO  http.Http -
http.proxy.port = 8080
> 2007-08-09 13:56:09,249 INFO  http.Http - http.timeout
= 10000
> 2007-08-09 13:56:09,249 INFO  http.Http -
http.content.limit = 65536
> 2007-08-09 13:56:09,249 INFO  http.Http - http.agent =
> currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk;
> http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
> 2007-08-09 13:56:09,249 INFO  http.Http -
> protocol.plugin.check.blocking = true
> 2007-08-09 13:56:09,249 INFO  http.Http -
protocol.plugin.check.robots
> = true
> 2007-08-09 13:56:09,249 INFO  http.Http -
fetcher.server.delay = 3000
> 2007-08-09 13:56:09,258 INFO  http.Http -
http.max.delays = 100
> 2007-08-09 13:56:09,259 FATAL api.RobotRulesParser -
Agent we advertise
> (currentNutch) not listed first in 'http.robots.agents'
property!
> 2007-08-09 13:56:09,259 INFO  http.Http -
http.proxy.host = null
> 2007-08-09 13:56:09,259 INFO  http.Http -
http.proxy.port = 8080
> 2007-08-09 13:56:09,259 INFO  http.Http - http.timeout
= 10000
> 2007-08-09 13:56:09,259 INFO  http.Http -
http.content.limit = 65536
> 2007-08-09 13:56:09,259 INFO  http.Http - http.agent =
> currentNutch/Nutch-1.0-dev (crawler v0.9 from trunk;
> http://hopoo.dyndns.org;
kai(underscore)testing(att)yahoo(dotcom))
> 2007-08-09 13:56:09,259 INFO  http.Http -
> protocol.plugin.check.blocking = true
> 2007-08-09 13:56:09,259 INFO  http.Http -
protocol.plugin.check.robots
> = true
> 2007-08-09 13:56:09,259 INFO  http.Http -
fetcher.server.delay = 3000
> 2007-08-09 13:56:09,259 INFO  http.Http -
http.max.delays = 100
> 2007-08-09 13:56:10,094 WARN  regex.RegexURLNormalizer
- can't find
> rules for scope 'outlink', using default
> 2007-08-09 13:56:10,123 INFO  crawl.SignatureFactory -
Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> 2007-08-09 13:56:17,661 INFO  plugin.PluginRepository -
Plugins:
> looking in: /usr/tmp2/nutch-trunk/build/plugins
> 2007-08-09 13:56:17,797 INFO  plugin.PluginRepository -
Plugin
> Auto-activation mode: [true]
> 2007-08-09 13:56:17,797 INFO  plugin.PluginRepository -
Registered
> Plugins:
> 2007-08-09 13:56:17,797 INFO  plugin.PluginRepository -
    CyberNeko
> HTML Parser (lib-nekohtml)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Site Query
> Filter (query-site)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Basic URL
> Normalizer (urlnormalizer-basic)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Html Parse
> Plug-in (parse-html)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Regex URL
> Filter Framework (lib-regex-filter)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Feed
> Parse/Index/Query Plug-in (feed)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Basic
> Indexing Filter (index-basic)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Basic
> Summarizer Plug-in (summary-basic)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Text Parse
> Plug-in (parse-text)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    JavaScript
> Parser (parse-js)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Basic Query
> Filter (query-basic)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Regex URL
> Filter (urlfilter-regex)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    HTTP
> Framework (lib-http)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    XML
> Libraries (lib-xml)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    URL Query
> Filter (query-url)
> 2007-08-09 13:56:17,798 INFO  plugin.PluginRepository -
    Regex URL
> Normalizer (urlnormalizer-regex)
> 2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -
    Http
> Protocol Plug-in (protocol-http)
> 2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -
    the nutch
> core extension points (nutch-extensionpoints)
> 2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -
    OPIC
> Scoring Plug-in (scoring-opic)
> 2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -
Registered
> Extension-Points:
> 2007-08-09 13:56:17,799 INFO  plugin.PluginRepository -
    Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Nutch
> Online Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Nutch
> Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Ontology
> Model Loader (org.apache.nutch.ontology.Ontology)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-08-09 13:56:17,800 INFO  plugin.PluginRepository -
    Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-08-09 13:56:18,911 WARN  mapred.LocalJobRunner -
job_blv6jf
> java.lang.IllegalArgumentException: Illegal Capacity:
-1
>     at
java.util.ArrayList.<init>(ArrayList.java:111)
>     at
>
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutput
Format.java
> :149)
>     at
>
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(Fetcher
OutputForma
> t.java:94)
>     at
>
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.jav
a:311)
>     at
>
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(Identity
Reducer.jav
> a:41)
>     at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
>     at
>
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:155
> )
>
>
>
>
>
>
>
____________________________________________________________
___________
> _____________
> Fussy? Opinionated? Impossible to please? Perfect. 
Join Yahoo!'s user
> panel and lay it on us.
> http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.a
sp?a=7
>


-- 
Doğacan Güney
[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )