List Info

Thread: Closed: (NUTCH-274) Empty row in/at end of URL-list results in error




Closed: (NUTCH-274) Empty row in/at end of URL-list results in error
user name
2006-12-28 00:22:23
     [ http://issues.apache.org/jira/browse/NUTCH-274?page=all ]

Andrzej Bialecki  closed NUTCH-274.
-----------------------------------

    Fix Version/s: 0.8.2
                   0.9.0
       Resolution: Fixed
         Assignee: Andrzej Bialecki 

This bug has been fixed in recent versions of Hadoop.

> Empty row in/at end of URL-list results in error
> ------------------------------------------------
>
>                 Key: NUTCH-274
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-274
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: nightly-2006-05-20
>            Reporter: Stefan Neufeind
>         Assigned To: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 0.8.2, 0.9.0
>
>         Attachments:
ignoreEmpthyLineDuringInjectV1.patch
>
>
> This is minor - but it's a little unclean 
> Reproduce: Have a URL-file with one URL followed by a
newline, thus producing an empty line.
> Outcome: Fetcher-threads try to fetch two URLs at the
same time. First one is fine - but second is empty and
therefor fails proper protocol-detection.
> 60521 022639   Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
> 060521 022639   Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
> 060521 022639 found resource parse-plugins.xml at
file:/home/mm/nutch-nightly/conf/parse-plugins.xml
> 060521 022639 Using URL normalizer:
org.apache.nutch.net.BasicUrlNormalizer
> 060521 022639 fetching http://www.bild.de/
> 060521 022639 fetching 
> 060521 022639 fetch of  failed with:
org.apache.nutch.protocol.ProtocolNotFound:
java.net.MalformedURLException: no protocol: 
> 060521 022639 http.proxy.host = null
> 060521 022639 http.proxy.port = 8080
> 060521 022639 http.timeout = 10000
> 060521 022639 http.content.limit = 65536
> 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; http://lucen
e.apache.org/nutch/bot.html; nutch-agentlucene.apache.org)
> 060521 022639 fetcher.server.delay = 1000
> 060521 022639 http.max.delays = 1000
> 060521 022640 ParserFactory:Plugin:
org.apache.nutch.parse.text.TextParser mapped to contentType
text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support
contentType: text/xml
> 060521 022640 ParserFactory:Plugin:
org.apache.nutch.parse.html.HtmlParser mapped to contentType
text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support
contentType: text/xml
> 060521 022640 ParserFactory: Plugin:
org.apache.nutch.parse.rss.RSSParser mapped to contentType
text/xml via parse-plugins.xml, but 
> not enabled via plugin.includes in nutch-default.xml
> 060521 022640 Using Signature impl:
org.apache.nutch.crawl.MD5Signature
> 060521 022640  map 0%  reduce 0%
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )