List Info

Thread: Created: (NUTCH-419) unavailable robots.txt kills fetch




Created: (NUTCH-419) unavailable robots.txt kills fetch
user name
2006-12-24 12:45:21
unavailable robots.txt kills fetch
----------------------------------

                 Key: NUTCH-419
                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-419
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.8.1
         Environment: Fetcher is behind a squid proxy, but I
am pretty sure this is irrelevant. 
Nutch in local mode, running on a linux machine with 2GB
RAM. 
            Reporter: Carsten Lehmann


I think there is another robots.txt-related problem which is
not
adressed by NUTCH-344,
but also results in an aborted fetch.

I am sure that in my last fetch all 17 fetcher threads died
while they were waiting for a robots.txt-file to be
delivered by a not
properly responding web server.

I looked at the squid access log, which is used by all fetch
threads.
It ends with many  HTTP-504-errors ("gateway
timeout") caused by a
certain robots.txt url:

<....>
1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html

These entries mean that it takes 15 minutes before the
request ends
with a timeout.
This can be calculated from the squid log, the first column
is the
request  time (in UTC seconds), the second column is the
duration of
the request (in ms):
900000/1000/60=15 minutes.

As far as I understand it, every time a fetch thread tries
to get this
robots.txt-file the thread busy waits for the duration of
the request
(15 minutes).
If this is right, then all 17 fetcher threads were caught in
this trap
at the time when  fetching was aborted, as there are 17
requests in
the squid log which did not timeout before the message 
"aborting with
17 threads" was written to the nutch-logfile.

Setting fetcher.max.crawl.delay can not help here.
I see 296 access attempts in total concerning this
robots.txt-url in
the squid log of this crawl, but fetcher.max.crawl.delay is
set to 30.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Updated: (NUTCH-419) unavailable robots.txt kills fetch
user name
2006-12-24 13:01:26
     [ http://issues.apache.org/jira/browse/NUTCH-419?page=all ]

Carsten Lehmann updated NUTCH-419:
----------------------------------

    Attachment: nutch-log.txt

Log extract from hadoop.log. Interesting are two points:

a) no entries in the log file between 22.51h and 23.02h, at
23.02h the fetch is aborted.
b) after the fetch is aborted, the stacktraces show
different urls (not http://XYZ.gso.gbv.de)
but this is what seems to be fetched, according to the last
requests in the squid log (see other attachment)

> unavailable robots.txt kills fetch
> ----------------------------------
>
>                 Key: NUTCH-419
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-419
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1
>         Environment: Fetcher is behind a squid proxy,
but I am pretty sure this is irrelevant. 
> Nutch in local mode, running on a linux machine with
2GB RAM. 
>            Reporter: Carsten Lehmann
>         Attachments: nutch-log.txt
>
>
> I think there is another robots.txt-related problem
which is not
> adressed by NUTCH-344,
> but also results in an aborted fetch.
> I am sure that in my last fetch all 17 fetcher threads
died
> while they were waiting for a robots.txt-file to be
delivered by a not
> properly responding web server.
> I looked at the squid access log, which is used by all
fetch threads.
> It ends with many  HTTP-504-errors ("gateway
timeout") caused by a
> certain robots.txt url:
> <....>
> 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> These entries mean that it takes 15 minutes before the
request ends
> with a timeout.
> This can be calculated from the squid log, the first
column is the
> request  time (in UTC seconds), the second column is
the duration of
> the request (in ms):
> 900000/1000/60=15 minutes.
> As far as I understand it, every time a fetch thread
tries to get this
> robots.txt-file the thread busy waits for the duration
of the request
> (15 minutes).
> If this is right, then all 17 fetcher threads were
caught in this trap
> at the time when  fetching was aborted, as there are 17
requests in
> the squid log which did not timeout before the message 
"aborting with
> 17 threads" was written to the nutch-logfile.
> Setting fetcher.max.crawl.delay can not help here.
> I see 296 access attempts in total concerning this
robots.txt-url in
> the squid log of this crawl, but
fetcher.max.crawl.delay is set to 30.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Updated: (NUTCH-419) unavailable robots.txt kills fetch
user name
2006-12-24 13:10:22
     [ http://issues.apache.org/jira/browse/NUTCH-419?page=all ]

Carsten Lehmann updated NUTCH-419:
----------------------------------

    Attachment: last_robots.txt_requests_squidlog.txt

> unavailable robots.txt kills fetch
> ----------------------------------
>
>                 Key: NUTCH-419
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-419
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1
>         Environment: Fetcher is behind a squid proxy,
but I am pretty sure this is irrelevant. 
> Nutch in local mode, running on a linux machine with
2GB RAM. 
>            Reporter: Carsten Lehmann
>         Attachments:
last_robots.txt_requests_squidlog.txt, nutch-log.txt,
squid_access_log_tail1000.txt
>
>
> I think there is another robots.txt-related problem
which is not
> adressed by NUTCH-344,
> but also results in an aborted fetch.
> I am sure that in my last fetch all 17 fetcher threads
died
> while they were waiting for a robots.txt-file to be
delivered by a not
> properly responding web server.
> I looked at the squid access log, which is used by all
fetch threads.
> It ends with many  HTTP-504-errors ("gateway
timeout") caused by a
> certain robots.txt url:
> <....>
> 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> These entries mean that it takes 15 minutes before the
request ends
> with a timeout.
> This can be calculated from the squid log, the first
column is the
> request  time (in UTC seconds), the second column is
the duration of
> the request (in ms):
> 900000/1000/60=15 minutes.
> As far as I understand it, every time a fetch thread
tries to get this
> robots.txt-file the thread busy waits for the duration
of the request
> (15 minutes).
> If this is right, then all 17 fetcher threads were
caught in this trap
> at the time when  fetching was aborted, as there are 17
requests in
> the squid log which did not timeout before the message 
"aborting with
> 17 threads" was written to the nutch-logfile.
> Setting fetcher.max.crawl.delay can not help here.
> I see 296 access attempts in total concerning this
robots.txt-url in
> the squid log of this crawl, but
fetcher.max.crawl.delay is set to 30.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Updated: (NUTCH-419) unavailable robots.txt kills fetch
user name
2006-12-24 13:10:24
     [ http://issues.apache.org/jira/browse/NUTCH-419?page=all ]

Carsten Lehmann updated NUTCH-419:
----------------------------------

    Attachment: squid_access_log_tail1000.txt

> unavailable robots.txt kills fetch
> ----------------------------------
>
>                 Key: NUTCH-419
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-419
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1
>         Environment: Fetcher is behind a squid proxy,
but I am pretty sure this is irrelevant. 
> Nutch in local mode, running on a linux machine with
2GB RAM. 
>            Reporter: Carsten Lehmann
>         Attachments:
last_robots.txt_requests_squidlog.txt, nutch-log.txt,
squid_access_log_tail1000.txt
>
>
> I think there is another robots.txt-related problem
which is not
> adressed by NUTCH-344,
> but also results in an aborted fetch.
> I am sure that in my last fetch all 17 fetcher threads
died
> while they were waiting for a robots.txt-file to be
delivered by a not
> properly responding web server.
> I looked at the squid access log, which is used by all
fetch threads.
> It ends with many  HTTP-504-errors ("gateway
timeout") caused by a
> certain robots.txt url:
> <....>
> 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> These entries mean that it takes 15 minutes before the
request ends
> with a timeout.
> This can be calculated from the squid log, the first
column is the
> request  time (in UTC seconds), the second column is
the duration of
> the request (in ms):
> 900000/1000/60=15 minutes.
> As far as I understand it, every time a fetch thread
tries to get this
> robots.txt-file the thread busy waits for the duration
of the request
> (15 minutes).
> If this is right, then all 17 fetcher threads were
caught in this trap
> at the time when  fetching was aborted, as there are 17
requests in
> the squid log which did not timeout before the message 
"aborting with
> 17 threads" was written to the nutch-logfile.
> Setting fetcher.max.crawl.delay can not help here.
> I see 296 access attempts in total concerning this
robots.txt-url in
> the squid log of this crawl, but
fetcher.max.crawl.delay is set to 30.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-419) unavailable robots.txt kills fetch
user name
2006-12-24 13:26:22
    [ http://issues.apache.org/jira/browse
/NUTCH-419?page=comments#action_12460696 ] 
            
Carsten Lehmann commented on NUTCH-419:
---------------------------------------

Some more explanations:

Above I meant http://gso.gbv.de/XYZ, not
http://XYZ.gso.gbv.de of
course.


I have attached two other log extracts:

a) squid_access_log_tail1000.txt 

this file contains the last 1000 lines of the squid access
log.
it shows what the fetcher has actually been doing before the
fetch gets aborted.
It ends with a number of requests to that certain
robots.txt-url.

b)  last_robots.txt_requests_squidlog.txt

this files shows the last requests to that certain
robot.txt-url. 

it might be of concern that near the end of this file the
line
1166652145.652 1042451 127.0.0.1 TCP_MISS/504 1450 GET http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
repeats 14 times.
this means that there have been 14 simultaenous requests to
this url,
right? 
are requests to the robots.txt-file not included in
"fetcher.server.delay", which is set to
"2.0" in
my configuration?
anyway, this seems to be ill behaviour.

> unavailable robots.txt kills fetch
> ----------------------------------
>
>                 Key: NUTCH-419
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-419
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1
>         Environment: Fetcher is behind a squid proxy,
but I am pretty sure this is irrelevant. 
> Nutch in local mode, running on a linux machine with
2GB RAM. 
>            Reporter: Carsten Lehmann
>         Attachments:
last_robots.txt_requests_squidlog.txt, nutch-log.txt,
squid_access_log_tail1000.txt
>
>
> I think there is another robots.txt-related problem
which is not
> adressed by NUTCH-344,
> but also results in an aborted fetch.
> I am sure that in my last fetch all 17 fetcher threads
died
> while they were waiting for a robots.txt-file to be
delivered by a not
> properly responding web server.
> I looked at the squid access log, which is used by all
fetch threads.
> It ends with many  HTTP-504-errors ("gateway
timeout") caused by a
> certain robots.txt url:
> <....>
> 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
> http://gso.gbv.de/robots
.txt - DIRECT/193.174.240.8 text/html
> These entries mean that it takes 15 minutes before the
request ends
> with a timeout.
> This can be calculated from the squid log, the first
column is the
> request  time (in UTC seconds), the second column is
the duration of
> the request (in ms):
> 900000/1000/60=15 minutes.
> As far as I understand it, every time a fetch thread
tries to get this
> robots.txt-file the thread busy waits for the duration
of the request
> (15 minutes).
> If this is right, then all 17 fetcher threads were
caught in this trap
> at the time when  fetching was aborted, as there are 17
requests in
> the squid log which did not timeout before the message 
"aborting with
> 17 threads" was written to the nutch-logfile.
> Setting fetcher.max.crawl.delay can not help here.
> I see 296 access attempts in total concerning this
robots.txt-url in
> the squid log of this crawl, but
fetcher.max.crawl.delay is set to 30.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )