List Info

Thread: Created: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring




Created: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring
user name
2006-12-15 12:47:20
CrawlDatum status and CrawlDbReducer refactoring
------------------------------------------------

                 Key: NUTCH-416
                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-416
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Andrzej Bialecki 
         Assigned To: Andrzej Bialecki 
             Fix For: 0.9.0


CrawlDatum needs more status codes, e.g. to reflect
redirected pages. However, current values of status codes
are linear, which prevents us from adding new codes in
proper places. This is also related to the logic in
CrawlDbReducer, which makes decisions based on arithmetic
ordering of status code values.

I propose to change the codes so that they are grouped into
related values, with significant gaps between groups for
adding new codes without causing significant reordering. I
also propose to change the logic in CrawlDbReducer so that
its operation is not so dependent on actual code values.

A mapping should also be added between old and new codes to
facilitate backward-compatibility of existing data. This
mapping should be applied on the fly, without requiring
explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring
user name
2006-12-20 22:40:22
    [ http://issues.apache.org/jira/browse
/NUTCH-416?page=comments#action_12460080 ] 
            
Doug Cook commented on NUTCH-416:
---------------------------------

You may also want to make the status codes ORed values, so
that, for example, all of the various kinds of failure all
have a FAILURE code ORed in, making it clean & easy in
the code to check for "any failure case" while
still allowing different failure codes. So at  the lowest
levels, the values might be things like FAILED, FETCHED, and
UNFETCHED, while REDIRECT might be (FETCHED | something),
specific redirect codes would be (REDIRECT | something),
specific failure codes would be (FAILED | something), etc.
This way we can keep all of the specific failure codes, all
the specific redirect codes, etc. while making the code
cleaner and more reliable. We won't have to worry about
keeping range checks or switch statements in sync if we add
new codes; a statement like
   if (code & FAILED != 0) {
   }
will always tell us whether a URL fetch failed, regardless
of how many codes we add. The way the code currently is,
adding status codes is likely to break things if one is not
careful to go through every place where status codes are
examined to ensure that the new code is properly accounted
for.

While you're changing the CrawlDatum, it might also make
sense to store a second URL,e.g. that of the redirect
target. I have a hunch this will be very useful.

Just some thoughts. Thanks for making this happen.

Doug



> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect
redirected pages. However, current values of status codes
are linear, which prevents us from adding new codes in
proper places. This is also related to the logic in
CrawlDbReducer, which makes decisions based on arithmetic
ordering of status code values.
> I propose to change the codes so that they are grouped
into related values, with significant gaps between groups
for adding new codes without causing significant reordering.
I also propose to change the logic in CrawlDbReducer so that
its operation is not so dependent on actual code values.
> A mapping should also be added between old and new
codes to facilitate backward-compatibility of existing data.
This mapping should be applied on the fly, without requiring
explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring
user name
2006-12-20 23:18:22
    [ http://issues.apache.org/jira/browse
/NUTCH-416?page=comments#action_12460091 ] 
            
Andrzej Bialecki  commented on NUTCH-416:
-----------------------------------------

There are two main distinct groups of status codes, but not
along the lines of success/failure - these are DB and Fetch
status codes. Additionally, the number of available bits for
a bitmask is very small, because the status needs to fit in
a byte.

My patch in progress contains the following now:

  public static final byte STATUS_DB_UNFETCHED      = 0x01;
  public static final byte STATUS_DB_FETCHED        = 0x02;
  public static final byte STATUS_DB_GONE           = 0x03;
  public static final byte STATUS_DB_REDIR_TEMP     = 0x04;
  public static final byte STATUS_DB_REDIR_PERM     = 0x05;
  
  /** Maximum value of DB-related status. */
  public static final byte STATUS_DB_MAX            = 0x1f;
  
  public static final byte STATUS_FETCH_SUCCESS     = 0x21;
  public static final byte STATUS_FETCH_RETRY       = 0x22;
  public static final byte STATUS_FETCH_REDIR_TEMP  = 0x23;
  public static final byte STATUS_FETCH_REDIR_PERM  = 0x24;
  public static final byte STATUS_FETCH_GONE        = 0x25;
  
  /** Maximum value of fetch-related status. */
  public static final byte STATUS_FETCH_MAX         = 0x3f;
  
  public static final byte STATUS_SIGNATURE         = 0x41;
  public static final byte STATUS_INJECTED          = 0x42;
  public static final byte STATUS_LINKED            = 0x43;
  
  public static boolean hasDbStatus(CrawlDatum datum) {
    if (datum.status <= STATUS_DB_MAX) return true;
    return false;
  }

  public static boolean hasFetchStatus(CrawlDatum datum) {
    if (datum.status > STATUS_DB_MAX &&
datum.status <= STATUS_FETCH_MAX) return true;
    return false;
  }

... so, I went with ranges of values. The most unwieldy
switch() statements in the current code were related to the
checking between DB or Fetch status, and the above two
static methods handle this and simplify the code.

Regarding the redirect URL - because of space constraints
I'd rather use Metadata for this. We already handle metadata
efficiently, so that performance doesn't suffer if we don't
have any metadata to keep. It would make sense, though, to
have a predefined key for this URL.

> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect
redirected pages. However, current values of status codes
are linear, which prevents us from adding new codes in
proper places. This is also related to the logic in
CrawlDbReducer, which makes decisions based on arithmetic
ordering of status code values.
> I propose to change the codes so that they are grouped
into related values, with significant gaps between groups
for adding new codes without causing significant reordering.
I also propose to change the logic in CrawlDbReducer so that
its operation is not so dependent on actual code values.
> A mapping should also be added between old and new
codes to facilitate backward-compatibility of existing data.
This mapping should be applied on the fly, without requiring
explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Closed: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring
user name
2006-12-28 00:14:22
     [ http://issues.apache.org/jira/browse/NUTCH-416?page=all ]

Andrzej Bialecki  closed NUTCH-416.
-----------------------------------

    Resolution: Fixed

Fixed in trunk, rev. 490607. As a side effect it is now
possible to correctly update CrawlDB from multiple segments,
even if they contain duplicate pages - the code in
CrawlDbReducer will correctly apply only the latest version
of CrawlDatum.

> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect
redirected pages. However, current values of status codes
are linear, which prevents us from adding new codes in
proper places. This is also related to the logic in
CrawlDbReducer, which makes decisions based on arithmetic
ordering of status code values.
> I propose to change the codes so that they are grouped
into related values, with significant gaps between groups
for adding new codes without causing significant reordering.
I also propose to change the logic in CrawlDbReducer so that
its operation is not so dependent on actual code values.
> A mapping should also be added between old and new
codes to facilitate backward-compatibility of existing data.
This mapping should be applied on the fly, without requiring
explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )