List Info

Thread: Updated: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no p




Updated: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no p
country flaguser name
United States
2007-09-17 13:20:43
     [ https://issues.apache.org/jira/browse/NUTCH-554?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Brian Whitman updated NUTCH-554:
--------------------------------

    Attachment: genpatch.diff

Attaching patch that seems to fix the problem for me. This
just catches the MalformedURLException from generator. I
don't know why this exception would create a fatal error in
nutch, but it was. 



> Generator throws java.io.IOException and dies on
injected urls with no protocol 
>
------------------------------------------------------------
--------------------
>
>                 Key: NUTCH-554
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-554
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>         Environment: Linux(debian) Java 1.6
>            Reporter: Brian Whitman
>         Attachments: genpatch.diff
>
>
> On trunk nutch, injecting URLs with no protocol (like
issues.apache.org/jira/ vs. https://issues.apache
.org/jira/) causes the generator to fail with an
IOException:
> java.net.MalformedURLException: no protocol:
www.variogr.am
>         at java.net.URL.<init>(URL.java:567)
>         at java.net.URL.<init>(URL.java:464)
>         at java.net.URL.<init>(URL.java:413)
>         at
org.apache.nutch.crawl.Generator$Selector.reduce(Generator.j
ava:187)
>         at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)

>         at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:155)
> 2007-09-15 11:11:26,986 FATAL crawl.Generator -
Generator: java.io.IOException: Job failed!
>         at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604
)
>         at
org.apache.nutch.crawl.Generator.generate(Generator.java:416
)
>         at
org.apache.nutch.crawl.Generator.run(Generator.java:557)
>         at
org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at
org.apache.nutch.crawl.Generator.main(Generator.java:520)
> To test:
> # cat test/urls.txt
> www.variogr.am
> http://www.variogr.am/
> # bin/nutch inject testcrawl/crawldb test/
> (this goes fine)
> # bin/nutch generate testcrawl/crawldb
testcrawl/segments -topN 10
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: testcrawl/segments/20070915111125
> Generator: filtering: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly
one partition.
> Generator: java.io.IOException: Job failed!
>  
> This issue did not exist in earlier versions of nutch
-- it would ignore the malformed URL without crashing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


[1]

about | contact  Other archives ( Real Estate discussion Medical topics )