[ http://issues.apache.org/jira/browse
/NUTCH-416?page=comments#action_12460080 ]
Doug Cook commented on NUTCH-416:
---------------------------------
You may also want to make the status codes ORed values, so
that, for example, all of the various kinds of failure all
have a FAILURE code ORed in, making it clean & easy in
the code to check for "any failure case" while
still allowing different failure codes. So at the lowest
levels, the values might be things like FAILED, FETCHED, and
UNFETCHED, while REDIRECT might be (FETCHED | something),
specific redirect codes would be (REDIRECT | something),
specific failure codes would be (FAILED | something), etc.
This way we can keep all of the specific failure codes, all
the specific redirect codes, etc. while making the code
cleaner and more reliable. We won't have to worry about
keeping range checks or switch statements in sync if we add
new codes; a statement like
if (code & FAILED != 0) {
}
will always tell us whether a URL fetch failed, regardless
of how many codes we add. The way the code currently is,
adding status codes is likely to break things if one is not
careful to go through every place where status codes are
examined to ensure that the new code is properly accounted
for.
While you're changing the CrawlDatum, it might also make
sense to store a second URL,e.g. that of the redirect
target. I have a hunch this will be very useful.
Just some thoughts. Thanks for making this happen.
Doug
> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
> Key: NUTCH-416
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-416
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Andrzej Bialecki
> Assigned To: Andrzej Bialecki
> Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect
redirected pages. However, current values of status codes
are linear, which prevents us from adding new codes in
proper places. This is also related to the logic in
CrawlDbReducer, which makes decisions based on arithmetic
ordering of status code values.
> I propose to change the codes so that they are grouped
into related values, with significant gaps between groups
for adding new codes without causing significant reordering.
I also propose to change the logic in CrawlDbReducer so that
its operation is not so dependent on actual code values.
> A mapping should also be added between old and new
codes to facilitate backward-compatibility of existing data.
This mapping should be applied on the fly, without requiring
explicit data conversion.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|