List Info

Thread: Fetching nothing on certain sites ??




Fetching nothing on certain sites ??
country flaguser name
United States
2007-10-08 09:17:59
Hi

I am trying to crawl some sites

    http://www.nytimes.com/
    http://news.google.com/

I am able to successfully crawl other sites.   And I have
the 
regex-urlfilter.txt

accepting http and most file suffixes and accepting those
domains that I 
am crawling.

But there is no parsed text on the initial url.   So it
doesn't fetch 
anything??

Any clues on what I am doing wrong??

Nancy

Re: Fetching nothing on certain sites ??
user name
2007-10-08 09:50:43
Have you setup your user agent in the nutch-site.xml file? 
The 
http.agent.name configuration variable is required and it is
best to set 
  http.agent.description, http.agent.url, http.agent.email,
and 
http.agent.version as well.

Dennis Kubes

Nancy Snyder wrote:
> Hi
> 
> I am trying to crawl some sites
> 
>    http://www.nytimes.com/
>    http://news.google.com/
> 
> I am able to successfully crawl other sites.   And I
have the 
> regex-urlfilter.txt
> 
> accepting http and most file suffixes and accepting
those domains that I 
> am crawling.
> 
> But there is no parsed text on the initial url.   So it
doesn't fetch 
> anything??
> 
> Any clues on what I am doing wrong??
> 
> Nancy

Re: Fetching nothing on certain sites ??
country flaguser name
United States
2007-10-08 10:07:24
Hi
Yes I have those filled in with values in my
nutch-site.xml.
Nancy

Dennis Kubes wrote:

> Have you setup your user agent in the nutch-site.xml
file?  The 
> http.agent.name configuration variable is required and
it is best to 
> set  http.agent.description, http.agent.url,
http.agent.email, and 
> http.agent.version as well.
>
> Dennis Kubes
>
> Nancy Snyder wrote:
>
>> Hi
>>
>> I am trying to crawl some sites
>>
>>    http://www.nytimes.com/
>>    http://news.google.com/
>>
>> I am able to successfully crawl other sites.   And
I have the 
>> regex-urlfilter.txt
>>
>> accepting http and most file suffixes and accepting
those domains 
>> that I am crawling.
>>
>> But there is no parsed text on the initial url.  
So it doesn't fetch 
>> anything??
>>
>> Any clues on what I am doing wrong??
>>
>> Nancy
>


Re: Fetching nothing on certain sites ??
user name
2007-10-08 10:28:51
Can you post your regex-urlfilter file?  Google news seems
to block most 
things at the robots.txt level but nytimes doesn't.

Dennis Kubes

Nancy Snyder wrote:
> Hi
> Yes I have those filled in with values in my
nutch-site.xml.
> Nancy
> 
> Dennis Kubes wrote:
> 
>> Have you setup your user agent in the
nutch-site.xml file?  The 
>> http.agent.name configuration variable is required
and it is best to 
>> set  http.agent.description, http.agent.url,
http.agent.email, and 
>> http.agent.version as well.
>>
>> Dennis Kubes
>>
>> Nancy Snyder wrote:
>>
>>> Hi
>>>
>>> I am trying to crawl some sites
>>>
>>>    http://www.nytimes.com/
>>>    http://news.google.com/
>>>
>>> I am able to successfully crawl other sites.  
And I have the 
>>> regex-urlfilter.txt
>>>
>>> accepting http and most file suffixes and
accepting those domains 
>>> that I am crawling.
>>>
>>> But there is no parsed text on the initial url.
  So it doesn't fetch 
>>> anything??
>>>
>>> Any clues on what I am doing wrong??
>>>
>>> Nancy
>>
> 

Re: Fetching nothing on certain sites ??
country flaguser name
United States
2007-10-08 15:10:08
Hi

I played around with the configuration files by commenting
out one thing 
at a time and
I found the problem with my nutch-site.xml.  I had the
following:

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded
content, in bytes.
  If this value is nonnegative (>=0), content longer than
it will be 
truncated;
  otherwise, no truncation at all.
  </description>
</property>

And this caused it to fetch NOTHING!   I changed that -1 to

<value>1073741824</value>
and it now works!   I am using nutch-0.8.1 but plan to
switch to 
nutch-0.9 soon.

Thanks for your help.  

Nancy


Dennis Kubes wrote:

> Can you post your regex-urlfilter file?  Google news
seems to block 
> most things at the robots.txt level but nytimes
doesn't.
>
> Dennis Kubes
>
> Nancy Snyder wrote:
>
>> Hi
>> Yes I have those filled in with values in my
nutch-site.xml.
>> Nancy
>>
>> Dennis Kubes wrote:
>>
>>> Have you setup your user agent in the
nutch-site.xml file?  The 
>>> http.agent.name configuration variable is
required and it is best to 
>>> set  http.agent.description, http.agent.url,
http.agent.email, and 
>>> http.agent.version as well.
>>>
>>> Dennis Kubes
>>>
>>> Nancy Snyder wrote:
>>>
>>>> Hi
>>>>
>>>> I am trying to crawl some sites
>>>>
>>>>    http://www.nytimes.com/
>>>>    http://news.google.com/
>>>>
>>>> I am able to successfully crawl other
sites.   And I have the 
>>>> regex-urlfilter.txt
>>>>
>>>> accepting http and most file suffixes and
accepting those domains 
>>>> that I am crawling.
>>>>
>>>> But there is no parsed text on the initial
url.   So it doesn't 
>>>> fetch anything??
>>>>
>>>> Any clues on what I am doing wrong??
>>>>
>>>> Nancy
>>>
>>>
>>


[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )