|
List Info
Thread: Created: (NUTCH-516) Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE
|
|
| Created: (NUTCH-516) Next fetch time is
not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-17 07:08:23 |
Next fetch time is not set when it is a
CrawlDatum.STATUS_FETCH_GONE
------------------------------------------------------------
--------
Key: NUTCH-516
URL: https
://issues.apache.org/jira/browse/NUTCH-516
Project: Nutch
Issue Type: Bug
Components: fetcher
Environment: Java 1.6, Linux 2.6
Reporter: Emmanuel Joke
Fix For: 1.0.0
We can not crawl some page due to a robots restriction. In
this case we update the db with the Metada:
_pst_:robots_denied(18) , we add the status code 3 and we
change the fecth interval to 67.5 days.
Unfortunetely the Fetch time is never change, so it keeps
generating this page and fetching it every time.
We should update the schedule fetch in crawldb to reflect to
the fetch interval.
We should add in crawldbreducer:
case CrawlDatum.STATUS_FETCH_GONE: // permanent
failure
if (old != null)
result.setSignature(old.getSignature()); // use old
signature
result.setStatus(CrawlDatum.STATUS_DB_GONE);
result = schedule.setPageGoneSchedule((Text)key,
result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime());
// set the schedule
result = schedule.setFetchSchedule((Text)key, result,
prevFetchTime,
prevModifiedTime, fetch.getFetchTime(),
fetch.getModifiedTime(), modified);
break;
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-516) Next fetch time
is not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-17 08:53:05 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12513249 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-516:
-------------------------------------
ACTUALLY SCHEDULE.SETPAGEGONESCHEDULE UPDATES THE NEXT FETCH
TIME. THE PROBLEM IS THAT DEFAULT FETCHINTERVAL IS TINY (30
SECONDS) SO SETTING NEXT FETCH TIME TO FETCHINTERVAL * 1.5
(AS SETPAGEGONESCHEDULE DOES) PUSHES NEXT FETCH TIME ONLY 45
SECONDS LATER. I THINK IF YOU RETRY WITH NUTCH-515 YOUR
PROBLEM WILL BE SOLVED. NOTE: AFTER APPLYING NUTCH-515, YOU
HAVE TO RE-INJECT AND RUN PARSE AND UPDATEDB-S FOR ALL
SEGMENS AGAIN (BUT YOU DON'T NEED TO REFETCH).
> NEXT FETCH TIME IS NOT SET WHEN IT IS A
CRAWLDATUM.STATUS_FETCH_GONE
>
------------------------------------------------------------
--------
>
> KEY: NUTCH-516
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516
> PROJECT: NUTCH
> ISSUE TYPE: BUG
> COMPONENTS: FETCHER
> ENVIRONMENT: JAVA 1.6, LINUX 2.6
> REPORTER: EMMANUEL JOKE
> FIX FOR: 1.0.0
>
>
> WE CAN NOT CRAWL SOME PAGE DUE TO A ROBOTS RESTRICTION.
IN THIS CASE WE UPDATE THE DB WITH THE METADA:
_PST_:ROBOTS_DENIED(18) , WE ADD THE STATUS CODE 3 AND WE
CHANGE THE FECTH INTERVAL TO 67.5 DAYS.
> UNFORTUNETELY THE FETCH TIME IS NEVER CHANGE, SO IT
KEEPS GENERATING THIS PAGE AND FETCHING IT EVERY TIME.
> WE SHOULD UPDATE THE SCHEDULE FETCH IN CRAWLDB TO
REFLECT TO THE FETCH INTERVAL.
> WE SHOULD ADD IN CRAWLDBREDUCER:
> CASE CRAWLDATUM.STATUS_FETCH_GONE: //
PERMANENT FAILURE
> IF (OLD != NULL)
> RESULT.SETSIGNATURE(OLD.GETSIGNATURE()); //
USE OLD SIGNATURE
> RESULT.SETSTATUS(CRAWLDATUM.STATUS_DB_GONE);
> RESULT = SCHEDULE.SETPAGEGONESCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME());
> // SET THE SCHEDULE
> RESULT = SCHEDULE.SETFETCHSCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME(),
FETCH.GETMODIFIEDTIME(), MODIFIED);
> BREAK;
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Updated: (NUTCH-516) Next fetch time is
not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-17 09:19:04 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]
DO?ACAN GüNEY UPDATED NUTCH-516:
--------------------------------
COMMENT: WAS DELETED
> NEXT FETCH TIME IS NOT SET WHEN IT IS A
CRAWLDATUM.STATUS_FETCH_GONE
>
------------------------------------------------------------
--------
>
> KEY: NUTCH-516
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516
> PROJECT: NUTCH
> ISSUE TYPE: BUG
> COMPONENTS: FETCHER
> ENVIRONMENT: JAVA 1.6, LINUX 2.6
> REPORTER: EMMANUEL JOKE
> FIX FOR: 1.0.0
>
>
> WE CAN NOT CRAWL SOME PAGE DUE TO A ROBOTS RESTRICTION.
IN THIS CASE WE UPDATE THE DB WITH THE METADA:
_PST_:ROBOTS_DENIED(18) , WE ADD THE STATUS CODE 3 AND WE
CHANGE THE FECTH INTERVAL TO 67.5 DAYS.
> UNFORTUNETELY THE FETCH TIME IS NEVER CHANGE, SO IT
KEEPS GENERATING THIS PAGE AND FETCHING IT EVERY TIME.
> WE SHOULD UPDATE THE SCHEDULE FETCH IN CRAWLDB TO
REFLECT TO THE FETCH INTERVAL.
> WE SHOULD ADD IN CRAWLDBREDUCER:
> CASE CRAWLDATUM.STATUS_FETCH_GONE: //
PERMANENT FAILURE
> IF (OLD != NULL)
> RESULT.SETSIGNATURE(OLD.GETSIGNATURE()); //
USE OLD SIGNATURE
> RESULT.SETSTATUS(CRAWLDATUM.STATUS_DB_GONE);
> RESULT = SCHEDULE.SETPAGEGONESCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME());
> // SET THE SCHEDULE
> RESULT = SCHEDULE.SETFETCHSCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME(),
FETCH.GETMODIFIEDTIME(), MODIFIED);
> BREAK;
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-516) Next fetch time
is not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-17 09:28:04 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12513255 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-516:
-------------------------------------
MY LAST (DELETED) COMMENT IS COMPLETELY WRONG. YOU ARE RIGHT
THAT FETCH TIME IS NOT UPDATED.
I THINK IT WOULD BE BETTER TO UPDATE SETPAGEGONESCHEDULE TO
SET NEXT FETCH TIME INSTEAD OF CALLING SETFETCHSCHEDULE IN
CRAWLDBREDUCER.
CAN YOU ATTACH YOUR CHANGE AS A PATCH?
> NEXT FETCH TIME IS NOT SET WHEN IT IS A
CRAWLDATUM.STATUS_FETCH_GONE
>
------------------------------------------------------------
--------
>
> KEY: NUTCH-516
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516
> PROJECT: NUTCH
> ISSUE TYPE: BUG
> COMPONENTS: FETCHER
> ENVIRONMENT: JAVA 1.6, LINUX 2.6
> REPORTER: EMMANUEL JOKE
> FIX FOR: 1.0.0
>
>
> WE CAN NOT CRAWL SOME PAGE DUE TO A ROBOTS RESTRICTION.
IN THIS CASE WE UPDATE THE DB WITH THE METADA:
_PST_:ROBOTS_DENIED(18) , WE ADD THE STATUS CODE 3 AND WE
CHANGE THE FECTH INTERVAL TO 67.5 DAYS.
> UNFORTUNETELY THE FETCH TIME IS NEVER CHANGE, SO IT
KEEPS GENERATING THIS PAGE AND FETCHING IT EVERY TIME.
> WE SHOULD UPDATE THE SCHEDULE FETCH IN CRAWLDB TO
REFLECT TO THE FETCH INTERVAL.
> WE SHOULD ADD IN CRAWLDBREDUCER:
> CASE CRAWLDATUM.STATUS_FETCH_GONE: //
PERMANENT FAILURE
> IF (OLD != NULL)
> RESULT.SETSIGNATURE(OLD.GETSIGNATURE()); //
USE OLD SIGNATURE
> RESULT.SETSTATUS(CRAWLDATUM.STATUS_DB_GONE);
> RESULT = SCHEDULE.SETPAGEGONESCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME());
> // SET THE SCHEDULE
> RESULT = SCHEDULE.SETFETCHSCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME(),
FETCH.GETMODIFIEDTIME(), MODIFIED);
> BREAK;
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-516) Next fetch time
is not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-17 09:43:05 |
[ https://issues.apache.org/jira/browse/N
UTCH-516?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12513258 ]
Andrzej Bialecki commented on NUTCH-516:
-----------------------------------------
setPageGoneSchedule method was specifically added to handle
this case. The bug is in this method (in the Schedule
implementation) and it should be fixed there, instead of
calling setFetchSchedule.
> Next fetch time is not set when it is a
CrawlDatum.STATUS_FETCH_GONE
>
------------------------------------------------------------
--------
>
> Key: NUTCH-516
> URL: https
://issues.apache.org/jira/browse/NUTCH-516
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Environment: Java 1.6, Linux 2.6
> Reporter: Emmanuel Joke
> Fix For: 1.0.0
>
>
> We can not crawl some page due to a robots restriction.
In this case we update the db with the Metada:
_pst_:robots_denied(18) , we add the status code 3 and we
change the fecth interval to 67.5 days.
> Unfortunetely the Fetch time is never change, so it
keeps generating this page and fetching it every time.
> We should update the schedule fetch in crawldb to
reflect to the fetch interval.
> We should add in crawldbreducer:
> case CrawlDatum.STATUS_FETCH_GONE: //
permanent failure
> if (old != null)
> result.setSignature(old.getSignature()); //
use old signature
> result.setStatus(CrawlDatum.STATUS_DB_GONE);
> result = schedule.setPageGoneSchedule((Text)key,
result, prevFetchTime,
> prevModifiedTime, fetch.getFetchTime());
> // set the schedule
> result = schedule.setFetchSchedule((Text)key,
result, prevFetchTime,
> prevModifiedTime, fetch.getFetchTime(),
fetch.getModifiedTime(), modified);
> break;
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-516) Next fetch time is
not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-18 01:35:04 |
[ https://issues.apache.org/jira/browse/NUTCH-516?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-516:
--------------------------------
Attachment: NUTCH-516.patch
I fxied the issue by changing the FetchTime in
AbstractFetchSchedule.setPageGoneSchedule.
I also noticed that the FetchTime was not correctly set. It
was always 25 days later and not the 30 as defined in
nutch-default.xml. Actually the function:
Math.round(datum.getFetchInterval() * 1000.0f) always
returned the same value "2147483647" which is
Interger.MAX_VALUE. It didn't convert the value in LONG but
in INT. So in order to force it to convert in LONG I changed
1000.0f by 1000.0d.
Now the fecthtime will set be correctly according to the
FetchInterval.
> Next fetch time is not set when it is a
CrawlDatum.STATUS_FETCH_GONE
>
------------------------------------------------------------
--------
>
> Key: NUTCH-516
> URL: https
://issues.apache.org/jira/browse/NUTCH-516
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Environment: Java 1.6, Linux 2.6
> Reporter: Emmanuel Joke
> Fix For: 1.0.0
>
> Attachments: NUTCH-516.patch
>
>
> We can not crawl some page due to a robots restriction.
In this case we update the db with the Metada:
_pst_:robots_denied(18) , we add the status code 3 and we
change the fecth interval to 67.5 days.
> Unfortunetely the Fetch time is never change, so it
keeps generating this page and fetching it every time.
> We should update the schedule fetch in crawldb to
reflect to the fetch interval.
> We should add in crawldbreducer:
> case CrawlDatum.STATUS_FETCH_GONE: //
permanent failure
> if (old != null)
> result.setSignature(old.getSignature()); //
use old signature
> result.setStatus(CrawlDatum.STATUS_DB_GONE);
> result = schedule.setPageGoneSchedule((Text)key,
result, prevFetchTime,
> prevModifiedTime, fetch.getFetchTime());
> // set the schedule
> result = schedule.setFetchSchedule((Text)key,
result, prevFetchTime,
> prevModifiedTime, fetch.getFetchTime(),
fetch.getModifiedTime(), modified);
> break;
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-516) Next fetch time
is not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-18 03:16:04 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12513473 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-516:
-------------------------------------
NICE CATCH. I TOTALLY MISSED THE INT/LONG CASTING BUG.
+1
> NEXT FETCH TIME IS NOT SET WHEN IT IS A
CRAWLDATUM.STATUS_FETCH_GONE
>
------------------------------------------------------------
--------
>
> KEY: NUTCH-516
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516
> PROJECT: NUTCH
> ISSUE TYPE: BUG
> COMPONENTS: FETCHER
> ENVIRONMENT: JAVA 1.6, LINUX 2.6
> REPORTER: EMMANUEL JOKE
> FIX FOR: 1.0.0
>
> ATTACHMENTS: NUTCH-516.PATCH
>
>
> WE CAN NOT CRAWL SOME PAGE DUE TO A ROBOTS RESTRICTION.
IN THIS CASE WE UPDATE THE DB WITH THE METADA:
_PST_:ROBOTS_DENIED(18) , WE ADD THE STATUS CODE 3 AND WE
CHANGE THE FECTH INTERVAL TO 67.5 DAYS.
> UNFORTUNETELY THE FETCH TIME IS NEVER CHANGE, SO IT
KEEPS GENERATING THIS PAGE AND FETCHING IT EVERY TIME.
> WE SHOULD UPDATE THE SCHEDULE FETCH IN CRAWLDB TO
REFLECT TO THE FETCH INTERVAL.
> WE SHOULD ADD IN CRAWLDBREDUCER:
> CASE CRAWLDATUM.STATUS_FETCH_GONE: //
PERMANENT FAILURE
> IF (OLD != NULL)
> RESULT.SETSIGNATURE(OLD.GETSIGNATURE()); //
USE OLD SIGNATURE
> RESULT.SETSTATUS(CRAWLDATUM.STATUS_DB_GONE);
> RESULT = SCHEDULE.SETPAGEGONESCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME());
> // SET THE SCHEDULE
> RESULT = SCHEDULE.SETFETCHSCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME(),
FETCH.GETMODIFIEDTIME(), MODIFIED);
> BREAK;
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Resolved: (NUTCH-516) Next fetch time
is not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-26 03:37:32 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]
DO?ACAN GüNEY RESOLVED NUTCH-516.
---------------------------------
RESOLUTION: FIXED
PATCH COMMITTED IN REV. 559742.
> NEXT FETCH TIME IS NOT SET WHEN IT IS A
CRAWLDATUM.STATUS_FETCH_GONE
>
------------------------------------------------------------
--------
>
> KEY: NUTCH-516
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516
> PROJECT: NUTCH
> ISSUE TYPE: BUG
> COMPONENTS: FETCHER
> ENVIRONMENT: JAVA 1.6, LINUX 2.6
> REPORTER: EMMANUEL JOKE
> FIX FOR: 1.0.0
>
> ATTACHMENTS: NUTCH-516.PATCH
>
>
> WE CAN NOT CRAWL SOME PAGE DUE TO A ROBOTS RESTRICTION.
IN THIS CASE WE UPDATE THE DB WITH THE METADA:
_PST_:ROBOTS_DENIED(18) , WE ADD THE STATUS CODE 3 AND WE
CHANGE THE FECTH INTERVAL TO 67.5 DAYS.
> UNFORTUNETELY THE FETCH TIME IS NEVER CHANGE, SO IT
KEEPS GENERATING THIS PAGE AND FETCHING IT EVERY TIME.
> WE SHOULD UPDATE THE SCHEDULE FETCH IN CRAWLDB TO
REFLECT TO THE FETCH INTERVAL.
> WE SHOULD ADD IN CRAWLDBREDUCER:
> CASE CRAWLDATUM.STATUS_FETCH_GONE: //
PERMANENT FAILURE
> IF (OLD != NULL)
> RESULT.SETSIGNATURE(OLD.GETSIGNATURE()); //
USE OLD SIGNATURE
> RESULT.SETSTATUS(CRAWLDATUM.STATUS_DB_GONE);
> RESULT = SCHEDULE.SETPAGEGONESCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME());
> // SET THE SCHEDULE
> RESULT = SCHEDULE.SETFETCHSCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME(),
FETCH.GETMODIFIEDTIME(), MODIFIED);
> BREAK;
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Closed: (NUTCH-516) Next fetch time is
not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-26 07:55:41 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]
DO?ACAN GüNEY CLOSED NUTCH-516.
-------------------------------
RESOLVED AND COMMITTED.
> NEXT FETCH TIME IS NOT SET WHEN IT IS A
CRAWLDATUM.STATUS_FETCH_GONE
>
------------------------------------------------------------
--------
>
> KEY: NUTCH-516
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-516
> PROJECT: NUTCH
> ISSUE TYPE: BUG
> COMPONENTS: FETCHER
> ENVIRONMENT: JAVA 1.6, LINUX 2.6
> REPORTER: EMMANUEL JOKE
> FIX FOR: 1.0.0
>
> ATTACHMENTS: NUTCH-516.PATCH
>
>
> WE CAN NOT CRAWL SOME PAGE DUE TO A ROBOTS RESTRICTION.
IN THIS CASE WE UPDATE THE DB WITH THE METADA:
_PST_:ROBOTS_DENIED(18) , WE ADD THE STATUS CODE 3 AND WE
CHANGE THE FECTH INTERVAL TO 67.5 DAYS.
> UNFORTUNETELY THE FETCH TIME IS NEVER CHANGE, SO IT
KEEPS GENERATING THIS PAGE AND FETCHING IT EVERY TIME.
> WE SHOULD UPDATE THE SCHEDULE FETCH IN CRAWLDB TO
REFLECT TO THE FETCH INTERVAL.
> WE SHOULD ADD IN CRAWLDBREDUCER:
> CASE CRAWLDATUM.STATUS_FETCH_GONE: //
PERMANENT FAILURE
> IF (OLD != NULL)
> RESULT.SETSIGNATURE(OLD.GETSIGNATURE()); //
USE OLD SIGNATURE
> RESULT.SETSTATUS(CRAWLDATUM.STATUS_DB_GONE);
> RESULT = SCHEDULE.SETPAGEGONESCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME());
> // SET THE SCHEDULE
> RESULT = SCHEDULE.SETFETCHSCHEDULE((TEXT)KEY,
RESULT, PREVFETCHTIME,
> PREVMODIFIEDTIME, FETCH.GETFETCHTIME(),
FETCH.GETMODIFIEDTIME(), MODIFIED);
> BREAK;
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-516) Next fetch time
is not set when it is a
CrawlDatum.STATUS_FETCH_GONE |
  United States |
2007-07-26 23:25:04 |
[ https://issues.apache.org/jira/browse/N
UTCH-516?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12515954 ]
Hudson commented on NUTCH-516:
------------------------------
Integrated in Nutch-Nightly #162 (See [http://lucene.zones.apache.org:8080/hudson/job/
Nutch-Nightly/162/])
> Next fetch time is not set when it is a
CrawlDatum.STATUS_FETCH_GONE
>
------------------------------------------------------------
--------
>
> Key: NUTCH-516
> URL: https
://issues.apache.org/jira/browse/NUTCH-516
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Environment: Java 1.6, Linux 2.6
> Reporter: Emmanuel Joke
> Fix For: 1.0.0
>
> Attachments: NUTCH-516.patch
>
>
> We can not crawl some page due to a robots restriction.
In this case we update the db with the Metada:
_pst_:robots_denied(18) , we add the status code 3 and we
change the fecth interval to 67.5 days.
> Unfortunetely the Fetch time is never change, so it
keeps generating this page and fetching it every time.
> We should update the schedule fetch in crawldb to
reflect to the fetch interval.
> We should add in crawldbreducer:
> case CrawlDatum.STATUS_FETCH_GONE: //
permanent failure
> if (old != null)
> result.setSignature(old.getSignature()); //
use old signature
> result.setStatus(CrawlDatum.STATUS_DB_GONE);
> result = schedule.setPageGoneSchedule((Text)key,
result, prevFetchTime,
> prevModifiedTime, fetch.getFetchTime());
> // set the schedule
> result = schedule.setFetchSchedule((Text)key,
result, prevFetchTime,
> prevModifiedTime, fetch.getFetchTime(),
fetch.getModifiedTime(), modified);
> break;
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
[1-10]
|
|