Hi
I played around with the configuration files by commenting
out one thing
at a time and
I found the problem with my nutch-site.xml. I had the
following:
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded
content, in bytes.
If this value is nonnegative (>=0), content longer than
it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
And this caused it to fetch NOTHING! I changed that -1 to
<value>1073741824</value>
and it now works! I am using nutch-0.8.1 but plan to
switch to
nutch-0.9 soon.
Thanks for your help.
Nancy
Dennis Kubes wrote:
> Can you post your regex-urlfilter file? Google news
seems to block
> most things at the robots.txt level but nytimes
doesn't.
>
> Dennis Kubes
>
> Nancy Snyder wrote:
>
>> Hi
>> Yes I have those filled in with values in my
nutch-site.xml.
>> Nancy
>>
>> Dennis Kubes wrote:
>>
>>> Have you setup your user agent in the
nutch-site.xml file? The
>>> http.agent.name configuration variable is
required and it is best to
>>> set http.agent.description, http.agent.url,
http.agent.email, and
>>> http.agent.version as well.
>>>
>>> Dennis Kubes
>>>
>>> Nancy Snyder wrote:
>>>
>>>> Hi
>>>>
>>>> I am trying to crawl some sites
>>>>
>>>> http://www.nytimes.com/
>>>> http://news.google.com/
>>>>
>>>> I am able to successfully crawl other
sites. And I have the
>>>> regex-urlfilter.txt
>>>>
>>>> accepting http and most file suffixes and
accepting those domains
>>>> that I am crawling.
>>>>
>>>> But there is no parsed text on the initial
url. So it doesn't
>>>> fetch anything??
>>>>
>>>> Any clues on what I am doing wrong??
>>>>
>>>> Nancy
>>>
>>>
>>
|