List Info

Thread: Created: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation




Created: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-14 03:07:25
Use Combiner in LinkDb to increase speed of linkdb
generation
------------------------------------------------------------
-

                 Key: NUTCH-498
                 URL: https
://issues.apache.org/jira/browse/NUTCH-498
             Project: Nutch
          Issue Type: Improvement
          Components: linkdb
    Affects Versions: 0.9.0
            Reporter: Espen Amble Kolstad
            Priority: Minor


I tried to add the follwing combiner to LinkDb


   public static class LinkDbCombiner extends MapReduceBase
implements Reducer {
      private int _maxInlinks;

      Override
      public void configure(JobConf job) {
         super.configure(job);
         _maxInlinks =
job.getInt("db.max.inlinks", 10000);
      }

      public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter) throws
IOException {
            final Inlinks inlinks = (Inlinks)
values.next();
            int combined = 0;
            while (values.hasNext()) {
               Inlinks val = (Inlinks) values.next();
               for (Iterator it = val.iterator();
it.hasNext();) {
                  if (inlinks.size() >= _maxInlinks) {
                     output.collect(key, inlinks);
                     return;
                  }
                  Inlink in = (Inlink) it.next();
                  inlinks.add(in);
               }
               combined++;
            }
            if (inlinks.size() == 0) {
               return;
            }
            if (combined > 0) {
               reporter.incrCounter(Counters.COMBINED,
combined);
            }
            output.collect(key, inlinks);
      }
   }


This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.


|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|

That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-14 03:28:26
     [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Espen Amble Kolstad updated NUTCH-498:
--------------------------------------

    Description: 
I tried to add the follwing combiner to LinkDb

   public static enum Counters 

   public static class LinkDbCombiner extends MapReduceBase
implements Reducer {
      private int _maxInlinks;

      Override
      public void configure(JobConf job) {
         super.configure(job);
         _maxInlinks =
job.getInt("db.max.inlinks", 10000);
      }

      public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter) throws
IOException {
            final Inlinks inlinks = (Inlinks)
values.next();
            int combined = 0;
            while (values.hasNext()) {
               Inlinks val = (Inlinks) values.next();
               for (Iterator it = val.iterator();
it.hasNext();) {
                  if (inlinks.size() >= _maxInlinks) {
                     if (combined > 0) {
                       
reporter.incrCounter(Counters.COMBINED, combined);
                     }
                     output.collect(key, inlinks);
                     return;
                  }
                  Inlink in = (Inlink) it.next();
                  inlinks.add(in);
               }
               combined++;
            }
            if (inlinks.size() == 0) {
               return;
            }
            if (combined > 0) {
               reporter.incrCounter(Counters.COMBINED,
combined);
            }
            output.collect(key, inlinks);
      }
   }

This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.


Map output records    8717810541
Combined                  7632541507
Resulting output rec 1085269034

That's a 87% reduction of output records from the map phase

  was:
I tried to add the follwing combiner to LinkDb


   public static class LinkDbCombiner extends MapReduceBase
implements Reducer {
      private int _maxInlinks;

      Override
      public void configure(JobConf job) {
         super.configure(job);
         _maxInlinks =
job.getInt("db.max.inlinks", 10000);
      }

      public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter) throws
IOException {
            final Inlinks inlinks = (Inlinks)
values.next();
            int combined = 0;
            while (values.hasNext()) {
               Inlinks val = (Inlinks) values.next();
               for (Iterator it = val.iterator();
it.hasNext();) {
                  if (inlinks.size() >= _maxInlinks) {
                     output.collect(key, inlinks);
                     return;
                  }
                  Inlink in = (Inlink) it.next();
                  inlinks.add(in);
               }
               combined++;
            }
            if (inlinks.size() == 0) {
               return;
            }
            if (combined > 0) {
               reporter.incrCounter(Counters.COMBINED,
combined);
            }
            output.collect(key, inlinks);
      }
   }


This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.


|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|

That's a 87% reduction of output records from the map phase


> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
>                 Key: NUTCH-498
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters 
>    public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks =
job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
>             final Inlinks inlinks = (Inlinks)
values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator();
it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks)
{
>                      if (combined > 0) {
>                        
reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED,
combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-15 02:58:26
    [ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12505079 ] 

Enis Soztutar commented on NUTCH-498:
-------------------------------------

I think you may not want 
 
reporter.incrCounter(Counters.COMBINED, combined); 


which increments the counter by the total count so far, but
rather you may use 
 
reporter.incrCounter(Counters.COMBINED, 1); 

for each url combined. 

Could you make attach the patch against current trunk, so
that we can apply it directly. 


> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
>                 Key: NUTCH-498
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters 
>    public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks =
job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
>             final Inlinks inlinks = (Inlinks)
values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator();
it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks)
{
>                      if (combined > 0) {
>                        
reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED,
combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-15 04:21:26
     [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Espen Amble Kolstad updated NUTCH-498:
--------------------------------------

    Attachment: LinkDbCombiner.patch

Here's a patch for trunk

I removed the Counter since it's not really useful
information, only to show the reduction of output records.


> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
>                 Key: NUTCH-498
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters 
>    public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks =
job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
>             final Inlinks inlinks = (Inlinks)
values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator();
it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks)
{
>                      if (combined > 0) {
>                        
reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED,
combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-15 07:26:26
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12505197 ] 

DO?ACAN GüNEY COMMENTED ON NUTCH-498:
-------------------------------------

WHY CAN'T WE JUST SET COMBINER CLASS AS LINKDB? AFAICS, YOU
ARE NOT DOING ANYTHING DIFFERENT THAN LINKDB.REDUCE IN
LINKDBCOMBINER.REDUCE. A SINGLE-LINER

JOB.SETCOMBINERCLASS(LINKDB.CLASS);

SHOULD DO THE TRICK, SHOULDN'T IT?

> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
>                 KEY: NUTCH-498
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
>             PROJECT: NUTCH
>          ISSUE TYPE: IMPROVEMENT
>          COMPONENTS: LINKDB
>    AFFECTS VERSIONS: 0.9.0
>            REPORTER: ESPEN AMBLE KOLSTAD
>            PRIORITY: MINOR
>         ATTACHMENTS: LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
>    PUBLIC STATIC ENUM COUNTERS 
>    PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
>       PRIVATE INT _MAXINLINKS;
>       OVERRIDE
>       PUBLIC VOID CONFIGURE(JOBCONF JOB) {
>          SUPER.CONFIGURE(JOB);
>          _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
>       }
>       PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
>             FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
>             INT COMBINED = 0;
>             WHILE (VALUES.HASNEXT()) {
>                INLINKS VAL = (INLINKS) VALUES.NEXT();
>                FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
>                   IF (INLINKS.SIZE() >= _MAXINLINKS)
{
>                      IF (COMBINED > 0) {
>                        
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
>                      }
>                      OUTPUT.COLLECT(KEY, INLINKS);
>                      RETURN;
>                   }
>                   INLINK IN = (INLINK) IT.NEXT();
>                   INLINKS.ADD(IN);
>                }
>                COMBINED++;
>             }
>             IF (INLINKS.SIZE() == 0) {
>                RETURN;
>             }
>             IF (COMBINED > 0) {
>                REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
>             }
>             OUTPUT.COLLECT(KEY, INLINKS);
>       }
>    }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS    8717810541
> COMBINED                  7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-15 09:06:26
    [ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12505242 ] 

Espen Amble Kolstad commented on NUTCH-498:
-------------------------------------------

Yes, you're right

I forgot I added a new class just to get the Counter ...

> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
>                 Key: NUTCH-498
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters 
>    public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks =
job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
>             final Inlinks inlinks = (Inlinks)
values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator();
it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks)
{
>                      if (combined > 0) {
>                        
reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED,
combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-15 09:08:26
     [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Espen Amble Kolstad updated NUTCH-498:
--------------------------------------

    Attachment: LinkDbCombiner.patch

Made a patch for the one-liner mentioned above

> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
>                 Key: NUTCH-498
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch,
LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters 
>    public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks =
job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
>             final Inlinks inlinks = (Inlinks)
values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator();
it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks)
{
>                      if (combined > 0) {
>                        
reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED,
combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-15 09:24:28
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12505249 ] 

DO?ACAN GüNEY COMMENTED ON NUTCH-498:
-------------------------------------

AFTER EXAMINING THE CODE BETTER, I AM A BIT CONFUSED. WE
HAVE A LINKDB.MERGER.REDUCE AND LINKDB.REDUCE. THEY BOTH DO
THE SAME THING (AGGREGATE INLINKS UNTIL ITS SIZE IS
MAXINLINKS THEN COLLECT). WHY DO WE HAVE THEM SEPERATELY? IS
THERE A DIFFERENCE BETWEEN THEM THAT I AM MISSING?

> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
>                 KEY: NUTCH-498
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
>             PROJECT: NUTCH
>          ISSUE TYPE: IMPROVEMENT
>          COMPONENTS: LINKDB
>    AFFECTS VERSIONS: 0.9.0
>            REPORTER: ESPEN AMBLE KOLSTAD
>            PRIORITY: MINOR
>         ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
>    PUBLIC STATIC ENUM COUNTERS 
>    PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
>       PRIVATE INT _MAXINLINKS;
>       OVERRIDE
>       PUBLIC VOID CONFIGURE(JOBCONF JOB) {
>          SUPER.CONFIGURE(JOB);
>          _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
>       }
>       PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
>             FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
>             INT COMBINED = 0;
>             WHILE (VALUES.HASNEXT()) {
>                INLINKS VAL = (INLINKS) VALUES.NEXT();
>                FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
>                   IF (INLINKS.SIZE() >= _MAXINLINKS)
{
>                      IF (COMBINED > 0) {
>                        
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
>                      }
>                      OUTPUT.COLLECT(KEY, INLINKS);
>                      RETURN;
>                   }
>                   INLINK IN = (INLINK) IT.NEXT();
>                   INLINKS.ADD(IN);
>                }
>                COMBINED++;
>             }
>             IF (INLINKS.SIZE() == 0) {
>                RETURN;
>             }
>             IF (COMBINED > 0) {
>                REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
>             }
>             OUTPUT.COLLECT(KEY, INLINKS);
>       }
>    }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS    8717810541
> COMBINED                  7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-15 11:53:26
    [ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12505302 ] 

Andrzej Bialecki  commented on NUTCH-498:
-----------------------------------------

Currently there is no difference, indeed. The version in
LinkDb.reduce is safer, because it uses a separate instance
of Inlinks. Perhaps we could replace LinkDb.Merger.reduce
with the body of LinkDb.reduce, and completely remove
LinkDb.reduce.

> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
>                 Key: NUTCH-498
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch,
LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters 
>    public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks =
job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
>             final Inlinks inlinks = (Inlinks)
values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator();
it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks)
{
>                      if (combined > 0) {
>                        
reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED,
combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-16 06:03:26
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12505454 ] 

DO?ACAN GüNEY COMMENTED ON NUTCH-498:
-------------------------------------

> CURRENTLY THERE IS NO DIFFERENCE, INDEED. THE VERSION
IN LINKDB.REDUCE IS SAFER, BECAUSE IT USES A SEPARATE
INSTANCE OF INLINKS. PERHAPS WE COULD 
> REPLACE LINKDB.MERGER.REDUCE WITH THE BODY OF
LINKDB.REDUCE, AND COMPLETELY REMOVE LINKDB.REDUCE.

SOUNDS GOOD. I OPENED NUTCH-499 FOR THIS.

> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
>                 KEY: NUTCH-498
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
>             PROJECT: NUTCH
>          ISSUE TYPE: IMPROVEMENT
>          COMPONENTS: LINKDB
>    AFFECTS VERSIONS: 0.9.0
>            REPORTER: ESPEN AMBLE KOLSTAD
>            PRIORITY: MINOR
>         ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
>    PUBLIC STATIC ENUM COUNTERS 
>    PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
>       PRIVATE INT _MAXINLINKS;
>       OVERRIDE
>       PUBLIC VOID CONFIGURE(JOBCONF JOB) {
>          SUPER.CONFIGURE(JOB);
>          _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
>       }
>       PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
>             FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
>             INT COMBINED = 0;
>             WHILE (VALUES.HASNEXT()) {
>                INLINKS VAL = (INLINKS) VALUES.NEXT();
>                FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
>                   IF (INLINKS.SIZE() >= _MAXINLINKS)
{
>                      IF (COMBINED > 0) {
>                        
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
>                      }
>                      OUTPUT.COLLECT(KEY, INLINKS);
>                      RETURN;
>                   }
>                   INLINK IN = (INLINK) IT.NEXT();
>                   INLINKS.ADD(IN);
>                }
>                COMBINED++;
>             }
>             IF (INLINKS.SIZE() == 0) {
>                RETURN;
>             }
>             IF (COMBINED > 0) {
>                REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
>             }
>             OUTPUT.COLLECT(KEY, INLINKS);
>       }
>    }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS    8717810541
> COMBINED                  7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-27 06:00:42
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12508505 ] 

DO?ACAN GüNEY COMMENTED ON NUTCH-498:
-------------------------------------

I TESTED CREATING A LINKDB FROM ~6M URLS:

COMBINE INPUT RECORDS  	 42,091,902
COMBINE OUTPUT RECORDS 	15,684,838

(COMBINER REDUCES NUMBER OF RECORDS TO AROUND 1/3.)

JOB TOOK ~15 MIN OVERALL WITH COMBINER, ~20 MINUTES WITHOUT
COMBINER.

SO, +1 FROM ME.




> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
>                 KEY: NUTCH-498
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
>             PROJECT: NUTCH
>          ISSUE TYPE: IMPROVEMENT
>          COMPONENTS: LINKDB
>    AFFECTS VERSIONS: 0.9.0
>            REPORTER: ESPEN AMBLE KOLSTAD
>            PRIORITY: MINOR
>         ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
>    PUBLIC STATIC ENUM COUNTERS 
>    PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
>       PRIVATE INT _MAXINLINKS;
>       OVERRIDE
>       PUBLIC VOID CONFIGURE(JOBCONF JOB) {
>          SUPER.CONFIGURE(JOB);
>          _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
>       }
>       PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
>             FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
>             INT COMBINED = 0;
>             WHILE (VALUES.HASNEXT()) {
>                INLINKS VAL = (INLINKS) VALUES.NEXT();
>                FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
>                   IF (INLINKS.SIZE() >= _MAXINLINKS)
{
>                      IF (COMBINED > 0) {
>                        
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
>                      }
>                      OUTPUT.COLLECT(KEY, INLINKS);
>                      RETURN;
>                   }
>                   INLINK IN = (INLINK) IT.NEXT();
>                   INLINKS.ADD(IN);
>                }
>                COMBINED++;
>             }
>             IF (INLINKS.SIZE() == 0) {
>                RETURN;
>             }
>             IF (COMBINED > 0) {
>                REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
>             }
>             OUTPUT.COLLECT(KEY, INLINKS);
>       }
>    }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS    8717810541
> COMBINED                  7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-27 06:08:26
    [ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12508506 ] 

Andrzej Bialecki  commented on NUTCH-498:
-----------------------------------------

+1.

> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
>                 Key: NUTCH-498
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch,
LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters 
>    public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks =
job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
>             final Inlinks inlinks = (Inlinks)
values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator();
it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks)
{
>                      if (combined > 0) {
>                        
reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED,
combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-27 06:19:26
    [ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12508508 ] 

Sami Siren commented on NUTCH-498:
----------------------------------

+1

> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
>                 Key: NUTCH-498
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch,
LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters 
>    public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks =
job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
>             final Inlinks inlinks = (Inlinks)
values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator();
it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks)
{
>                      if (combined > 0) {
>                        
reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED,
combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Resolved: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-27 07:47:26
     [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]

DO?ACAN GüNEY RESOLVED NUTCH-498.
---------------------------------

       RESOLUTION: FIXED
    FIX VERSION/S: 1.0.0
         ASSIGNEE: DO?ACAN GüNEY

COMMITTED IN REV. 551147.

> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
>                 KEY: NUTCH-498
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
>             PROJECT: NUTCH
>          ISSUE TYPE: IMPROVEMENT
>          COMPONENTS: LINKDB
>    AFFECTS VERSIONS: 0.9.0
>            REPORTER: ESPEN AMBLE KOLSTAD
>            ASSIGNEE: DO?ACAN GüNEY
>            PRIORITY: MINOR
>             FIX FOR: 1.0.0
>
>         ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
>    PUBLIC STATIC ENUM COUNTERS 
>    PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
>       PRIVATE INT _MAXINLINKS;
>       OVERRIDE
>       PUBLIC VOID CONFIGURE(JOBCONF JOB) {
>          SUPER.CONFIGURE(JOB);
>          _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
>       }
>       PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
>             FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
>             INT COMBINED = 0;
>             WHILE (VALUES.HASNEXT()) {
>                INLINKS VAL = (INLINKS) VALUES.NEXT();
>                FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
>                   IF (INLINKS.SIZE() >= _MAXINLINKS)
{
>                      IF (COMBINED > 0) {
>                        
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
>                      }
>                      OUTPUT.COLLECT(KEY, INLINKS);
>                      RETURN;
>                   }
>                   INLINK IN = (INLINK) IT.NEXT();
>                   INLINKS.ADD(IN);
>                }
>                COMBINED++;
>             }
>             IF (INLINKS.SIZE() == 0) {
>                RETURN;
>             }
>             IF (COMBINED > 0) {
>                REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
>             }
>             OUTPUT.COLLECT(KEY, INLINKS);
>       }
>    }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS    8717810541
> COMBINED                  7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Closed: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-27 07:47:26
     [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]

DO?ACAN GüNEY CLOSED NUTCH-498.
-------------------------------


ISSUE RESOLVED AND COMMITTED.

> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
>                 KEY: NUTCH-498
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
>             PROJECT: NUTCH
>          ISSUE TYPE: IMPROVEMENT
>          COMPONENTS: LINKDB
>    AFFECTS VERSIONS: 0.9.0
>            REPORTER: ESPEN AMBLE KOLSTAD
>            ASSIGNEE: DO?ACAN GüNEY
>            PRIORITY: MINOR
>             FIX FOR: 1.0.0
>
>         ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
>    PUBLIC STATIC ENUM COUNTERS 
>    PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
>       PRIVATE INT _MAXINLINKS;
>       OVERRIDE
>       PUBLIC VOID CONFIGURE(JOBCONF JOB) {
>          SUPER.CONFIGURE(JOB);
>          _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
>       }
>       PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
>             FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
>             INT COMBINED = 0;
>             WHILE (VALUES.HASNEXT()) {
>                INLINKS VAL = (INLINKS) VALUES.NEXT();
>                FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
>                   IF (INLINKS.SIZE() >= _MAXINLINKS)
{
>                      IF (COMBINED > 0) {
>                        
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
>                      }
>                      OUTPUT.COLLECT(KEY, INLINKS);
>                      RETURN;
>                   }
>                   INLINK IN = (INLINK) IT.NEXT();
>                   INLINKS.ADD(IN);
>                }
>                COMBINED++;
>             }
>             IF (INLINKS.SIZE() == 0) {
>                RETURN;
>             }
>             IF (COMBINED > 0) {
>                REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
>             }
>             OUTPUT.COLLECT(KEY, INLINKS);
>       }
>    }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS    8717810541
> COMBINED                  7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
country flaguser name
United States
2007-06-28 02:04:25
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12508748 ] 

HUDSON COMMENTED ON NUTCH-498:
------------------------------

INTEGRATED IN NUTCH-NIGHTLY #131 (SEE
[HTTP://LUCENE.ZONES.APACHE.ORG:8080/HUDSON/JOB/NUTCH-NIGHTL
Y/131/])

> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
>                 KEY: NUTCH-498
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
>             PROJECT: NUTCH
>          ISSUE TYPE: IMPROVEMENT
>          COMPONENTS: LINKDB
>    AFFECTS VERSIONS: 0.9.0
>            REPORTER: ESPEN AMBLE KOLSTAD
>            ASSIGNEE: DO?ACAN GüNEY
>            PRIORITY: MINOR
>             FIX FOR: 1.0.0
>
>         ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
>    PUBLIC STATIC ENUM COUNTERS 
>    PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
>       PRIVATE INT _MAXINLINKS;
>       OVERRIDE
>       PUBLIC VOID CONFIGURE(JOBCONF JOB) {
>          SUPER.CONFIGURE(JOB);
>          _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
>       }
>       PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
>             FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
>             INT COMBINED = 0;
>             WHILE (VALUES.HASNEXT()) {
>                INLINKS VAL = (INLINKS) VALUES.NEXT();
>                FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
>                   IF (INLINKS.SIZE() >= _MAXINLINKS)
{
>                      IF (COMBINED > 0) {
>                        
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
>                      }
>                      OUTPUT.COLLECT(KEY, INLINKS);
>                      RETURN;
>                   }
>                   INLINK IN = (INLINK) IT.NEXT();
>                   INLINKS.ADD(IN);
>                }
>                COMBINED++;
>             }
>             IF (INLINKS.SIZE() == 0) {
>                RETURN;
>             }
>             IF (COMBINED > 0) {
>                REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
>             }
>             OUTPUT.COLLECT(KEY, INLINKS);
>       }
>    }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS    8717810541
> COMBINED                  7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


[1-16]

about | contact  Other archives ( Real Estate discussion Medical topics )