[ https://issues.apache.org/jira/browse/NUTCH-554?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brian Whitman updated NUTCH-554:
--------------------------------
Attachment: genpatch.diff
Attaching patch that seems to fix the problem for me. This
just catches the MalformedURLException from generator. I
don't know why this exception would create a fatal error in
nutch, but it was.
> Generator throws java.io.IOException and dies on
injected urls with no protocol
>
------------------------------------------------------------
--------------------
>
> Key: NUTCH-554
> URL: https
://issues.apache.org/jira/browse/NUTCH-554
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 1.0.0
> Environment: Linux(debian) Java 1.6
> Reporter: Brian Whitman
> Attachments: genpatch.diff
>
>
> On trunk nutch, injecting URLs with no protocol (like
issues.apache.org/jira/ vs. https://issues.apache
.org/jira/) causes the generator to fail with an
IOException:
> java.net.MalformedURLException: no protocol:
www.variogr.am
> at java.net.URL.<init>(URL.java:567)
> at java.net.URL.<init>(URL.java:464)
> at java.net.URL.<init>(URL.java:413)
> at
org.apache.nutch.crawl.Generator$Selector.reduce(Generator.j
ava:187)
> at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
> at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:155)
> 2007-09-15 11:11:26,986 FATAL crawl.Generator -
Generator: java.io.IOException: Job failed!
> at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604
)
> at
org.apache.nutch.crawl.Generator.generate(Generator.java:416
)
> at
org.apache.nutch.crawl.Generator.run(Generator.java:557)
> at
org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
> at
org.apache.nutch.crawl.Generator.main(Generator.java:520)
> To test:
> # cat test/urls.txt
> www.variogr.am
> http://www.variogr.am/
> # bin/nutch inject testcrawl/crawldb test/
> (this goes fine)
> # bin/nutch generate testcrawl/crawldb
testcrawl/segments -topN 10
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: testcrawl/segments/20070915111125
> Generator: filtering: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly
one partition.
> Generator: java.io.IOException: Job failed!
>
> This issue did not exist in earlier versions of nutch
-- it would ignore the malformed URL without crashing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|