|
List Info
Thread: Created: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
|
|
| Created: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-14 03:07:25 |
Use Combiner in LinkDb to increase speed of linkdb
generation
------------------------------------------------------------
-
Key: NUTCH-498
URL: https
://issues.apache.org/jira/browse/NUTCH-498
Project: Nutch
Issue Type: Improvement
Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
I tried to add the follwing combiner to LinkDb
public static class LinkDbCombiner extends MapReduceBase
implements Reducer {
private int _maxInlinks;
Override
public void configure(JobConf job) {
super.configure(job);
_maxInlinks =
job.getInt("db.max.inlinks", 10000);
}
public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter) throws
IOException {
final Inlinks inlinks = (Inlinks)
values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator();
it.hasNext();) {
if (inlinks.size() >= _maxInlinks) {
output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0) {
return;
}
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED,
combined);
}
output.collect(key, inlinks);
}
}
This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|
That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-14 03:28:26 |
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Espen Amble Kolstad updated NUTCH-498:
--------------------------------------
Description:
I tried to add the follwing combiner to LinkDb
public static enum Counters
public static class LinkDbCombiner extends MapReduceBase
implements Reducer {
private int _maxInlinks;
Override
public void configure(JobConf job) {
super.configure(job);
_maxInlinks =
job.getInt("db.max.inlinks", 10000);
}
public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter) throws
IOException {
final Inlinks inlinks = (Inlinks)
values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator();
it.hasNext();) {
if (inlinks.size() >= _maxInlinks) {
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED, combined);
}
output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0) {
return;
}
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED,
combined);
}
output.collect(key, inlinks);
}
}
This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
Map output records 8717810541
Combined 7632541507
Resulting output rec 1085269034
That's a 87% reduction of output records from the map phase
was:
I tried to add the follwing combiner to LinkDb
public static class LinkDbCombiner extends MapReduceBase
implements Reducer {
private int _maxInlinks;
Override
public void configure(JobConf job) {
super.configure(job);
_maxInlinks =
job.getInt("db.max.inlinks", 10000);
}
public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter) throws
IOException {
final Inlinks inlinks = (Inlinks)
values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator();
it.hasNext();) {
if (inlinks.size() >= _maxInlinks) {
output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0) {
return;
}
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED,
combined);
}
output.collect(key, inlinks);
}
}
This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|
That's a 87% reduction of output records from the map phase
> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
> Key: NUTCH-498
> URL: https
://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters
> public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
> private int _maxInlinks;
> Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks =
job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
> final Inlinks inlinks = (Inlinks)
values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator();
it.hasNext();) {
> if (inlinks.size() >= _maxInlinks)
{
> if (combined > 0) {
>
reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED,
combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-15 02:58:26 |
[ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12505079 ]
Enis Soztutar commented on NUTCH-498:
-------------------------------------
I think you may not want
reporter.incrCounter(Counters.COMBINED, combined);
which increments the counter by the total count so far, but
rather you may use
reporter.incrCounter(Counters.COMBINED, 1);
for each url combined.
Could you make attach the patch against current trunk, so
that we can apply it directly.
> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
> Key: NUTCH-498
> URL: https
://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters
> public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
> private int _maxInlinks;
> Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks =
job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
> final Inlinks inlinks = (Inlinks)
values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator();
it.hasNext();) {
> if (inlinks.size() >= _maxInlinks)
{
> if (combined > 0) {
>
reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED,
combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-15 04:21:26 |
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Espen Amble Kolstad updated NUTCH-498:
--------------------------------------
Attachment: LinkDbCombiner.patch
Here's a patch for trunk
I removed the Counter since it's not really useful
information, only to show the reduction of output records.
> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
> Key: NUTCH-498
> URL: https
://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters
> public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
> private int _maxInlinks;
> Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks =
job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
> final Inlinks inlinks = (Inlinks)
values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator();
it.hasNext();) {
> if (inlinks.size() >= _maxInlinks)
{
> if (combined > 0) {
>
reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED,
combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-15 07:26:26 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12505197 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-498:
-------------------------------------
WHY CAN'T WE JUST SET COMBINER CLASS AS LINKDB? AFAICS, YOU
ARE NOT DOING ANYTHING DIFFERENT THAN LINKDB.REDUCE IN
LINKDBCOMBINER.REDUCE. A SINGLE-LINER
JOB.SETCOMBINERCLASS(LINKDB.CLASS);
SHOULD DO THE TRICK, SHOULDN'T IT?
> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
> KEY: NUTCH-498
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: LINKDB
> AFFECTS VERSIONS: 0.9.0
> REPORTER: ESPEN AMBLE KOLSTAD
> PRIORITY: MINOR
> ATTACHMENTS: LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
> PUBLIC STATIC ENUM COUNTERS
> PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
> PRIVATE INT _MAXINLINKS;
> OVERRIDE
> PUBLIC VOID CONFIGURE(JOBCONF JOB) {
> SUPER.CONFIGURE(JOB);
> _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
> }
> PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
> FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
> INT COMBINED = 0;
> WHILE (VALUES.HASNEXT()) {
> INLINKS VAL = (INLINKS) VALUES.NEXT();
> FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
> IF (INLINKS.SIZE() >= _MAXINLINKS)
{
> IF (COMBINED > 0) {
>
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> RETURN;
> }
> INLINK IN = (INLINK) IT.NEXT();
> INLINKS.ADD(IN);
> }
> COMBINED++;
> }
> IF (INLINKS.SIZE() == 0) {
> RETURN;
> }
> IF (COMBINED > 0) {
> REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> }
> }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS 8717810541
> COMBINED 7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-15 09:06:26 |
[ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12505242 ]
Espen Amble Kolstad commented on NUTCH-498:
-------------------------------------------
Yes, you're right
I forgot I added a new class just to get the Counter ...
> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
> Key: NUTCH-498
> URL: https
://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters
> public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
> private int _maxInlinks;
> Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks =
job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
> final Inlinks inlinks = (Inlinks)
values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator();
it.hasNext();) {
> if (inlinks.size() >= _maxInlinks)
{
> if (combined > 0) {
>
reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED,
combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-15 09:08:26 |
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Espen Amble Kolstad updated NUTCH-498:
--------------------------------------
Attachment: LinkDbCombiner.patch
Made a patch for the one-liner mentioned above
> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
> Key: NUTCH-498
> URL: https
://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch,
LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters
> public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
> private int _maxInlinks;
> Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks =
job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
> final Inlinks inlinks = (Inlinks)
values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator();
it.hasNext();) {
> if (inlinks.size() >= _maxInlinks)
{
> if (combined > 0) {
>
reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED,
combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-15 09:24:28 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12505249 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-498:
-------------------------------------
AFTER EXAMINING THE CODE BETTER, I AM A BIT CONFUSED. WE
HAVE A LINKDB.MERGER.REDUCE AND LINKDB.REDUCE. THEY BOTH DO
THE SAME THING (AGGREGATE INLINKS UNTIL ITS SIZE IS
MAXINLINKS THEN COLLECT). WHY DO WE HAVE THEM SEPERATELY? IS
THERE A DIFFERENCE BETWEEN THEM THAT I AM MISSING?
> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
> KEY: NUTCH-498
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: LINKDB
> AFFECTS VERSIONS: 0.9.0
> REPORTER: ESPEN AMBLE KOLSTAD
> PRIORITY: MINOR
> ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
> PUBLIC STATIC ENUM COUNTERS
> PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
> PRIVATE INT _MAXINLINKS;
> OVERRIDE
> PUBLIC VOID CONFIGURE(JOBCONF JOB) {
> SUPER.CONFIGURE(JOB);
> _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
> }
> PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
> FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
> INT COMBINED = 0;
> WHILE (VALUES.HASNEXT()) {
> INLINKS VAL = (INLINKS) VALUES.NEXT();
> FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
> IF (INLINKS.SIZE() >= _MAXINLINKS)
{
> IF (COMBINED > 0) {
>
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> RETURN;
> }
> INLINK IN = (INLINK) IT.NEXT();
> INLINKS.ADD(IN);
> }
> COMBINED++;
> }
> IF (INLINKS.SIZE() == 0) {
> RETURN;
> }
> IF (COMBINED > 0) {
> REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> }
> }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS 8717810541
> COMBINED 7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-15 11:53:26 |
[ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12505302 ]
Andrzej Bialecki commented on NUTCH-498:
-----------------------------------------
Currently there is no difference, indeed. The version in
LinkDb.reduce is safer, because it uses a separate instance
of Inlinks. Perhaps we could replace LinkDb.Merger.reduce
with the body of LinkDb.reduce, and completely remove
LinkDb.reduce.
> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
> Key: NUTCH-498
> URL: https
://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch,
LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters
> public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
> private int _maxInlinks;
> Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks =
job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
> final Inlinks inlinks = (Inlinks)
values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator();
it.hasNext();) {
> if (inlinks.size() >= _maxInlinks)
{
> if (combined > 0) {
>
reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED,
combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-16 06:03:26 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12505454 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-498:
-------------------------------------
> CURRENTLY THERE IS NO DIFFERENCE, INDEED. THE VERSION
IN LINKDB.REDUCE IS SAFER, BECAUSE IT USES A SEPARATE
INSTANCE OF INLINKS. PERHAPS WE COULD
> REPLACE LINKDB.MERGER.REDUCE WITH THE BODY OF
LINKDB.REDUCE, AND COMPLETELY REMOVE LINKDB.REDUCE.
SOUNDS GOOD. I OPENED NUTCH-499 FOR THIS.
> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
> KEY: NUTCH-498
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: LINKDB
> AFFECTS VERSIONS: 0.9.0
> REPORTER: ESPEN AMBLE KOLSTAD
> PRIORITY: MINOR
> ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
> PUBLIC STATIC ENUM COUNTERS
> PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
> PRIVATE INT _MAXINLINKS;
> OVERRIDE
> PUBLIC VOID CONFIGURE(JOBCONF JOB) {
> SUPER.CONFIGURE(JOB);
> _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
> }
> PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
> FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
> INT COMBINED = 0;
> WHILE (VALUES.HASNEXT()) {
> INLINKS VAL = (INLINKS) VALUES.NEXT();
> FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
> IF (INLINKS.SIZE() >= _MAXINLINKS)
{
> IF (COMBINED > 0) {
>
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> RETURN;
> }
> INLINK IN = (INLINK) IT.NEXT();
> INLINKS.ADD(IN);
> }
> COMBINED++;
> }
> IF (INLINKS.SIZE() == 0) {
> RETURN;
> }
> IF (COMBINED > 0) {
> REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> }
> }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS 8717810541
> COMBINED 7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-27 06:00:42 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12508505 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-498:
-------------------------------------
I TESTED CREATING A LINKDB FROM ~6M URLS:
COMBINE INPUT RECORDS 42,091,902
COMBINE OUTPUT RECORDS 15,684,838
(COMBINER REDUCES NUMBER OF RECORDS TO AROUND 1/3.)
JOB TOOK ~15 MIN OVERALL WITH COMBINER, ~20 MINUTES WITHOUT
COMBINER.
SO, +1 FROM ME.
> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
> KEY: NUTCH-498
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: LINKDB
> AFFECTS VERSIONS: 0.9.0
> REPORTER: ESPEN AMBLE KOLSTAD
> PRIORITY: MINOR
> ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
> PUBLIC STATIC ENUM COUNTERS
> PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
> PRIVATE INT _MAXINLINKS;
> OVERRIDE
> PUBLIC VOID CONFIGURE(JOBCONF JOB) {
> SUPER.CONFIGURE(JOB);
> _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
> }
> PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
> FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
> INT COMBINED = 0;
> WHILE (VALUES.HASNEXT()) {
> INLINKS VAL = (INLINKS) VALUES.NEXT();
> FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
> IF (INLINKS.SIZE() >= _MAXINLINKS)
{
> IF (COMBINED > 0) {
>
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> RETURN;
> }
> INLINK IN = (INLINK) IT.NEXT();
> INLINKS.ADD(IN);
> }
> COMBINED++;
> }
> IF (INLINKS.SIZE() == 0) {
> RETURN;
> }
> IF (COMBINED > 0) {
> REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> }
> }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS 8717810541
> COMBINED 7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-27 06:08:26 |
[ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12508506 ]
Andrzej Bialecki commented on NUTCH-498:
-----------------------------------------
+1.
> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
> Key: NUTCH-498
> URL: https
://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch,
LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters
> public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
> private int _maxInlinks;
> Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks =
job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
> final Inlinks inlinks = (Inlinks)
values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator();
it.hasNext();) {
> if (inlinks.size() >= _maxInlinks)
{
> if (combined > 0) {
>
reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED,
combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-27 06:19:26 |
[ https://issues.apache.org/jira/browse/N
UTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12508508 ]
Sami Siren commented on NUTCH-498:
----------------------------------
+1
> Use Combiner in LinkDb to increase speed of linkdb
generation
>
------------------------------------------------------------
-
>
> Key: NUTCH-498
> URL: https
://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
> Attachments: LinkDbCombiner.patch,
LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters
> public static class LinkDbCombiner extends
MapReduceBase implements Reducer {
> private int _maxInlinks;
> Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks =
job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key,
Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
> final Inlinks inlinks = (Inlinks)
values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator();
it.hasNext();) {
> if (inlinks.size() >= _maxInlinks)
{
> if (combined > 0) {
>
reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED,
combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new
linkdb. In my case it reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map
phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Resolved: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-27 07:47:26 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]
DO?ACAN GüNEY RESOLVED NUTCH-498.
---------------------------------
RESOLUTION: FIXED
FIX VERSION/S: 1.0.0
ASSIGNEE: DO?ACAN GüNEY
COMMITTED IN REV. 551147.
> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
> KEY: NUTCH-498
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: LINKDB
> AFFECTS VERSIONS: 0.9.0
> REPORTER: ESPEN AMBLE KOLSTAD
> ASSIGNEE: DO?ACAN GüNEY
> PRIORITY: MINOR
> FIX FOR: 1.0.0
>
> ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
> PUBLIC STATIC ENUM COUNTERS
> PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
> PRIVATE INT _MAXINLINKS;
> OVERRIDE
> PUBLIC VOID CONFIGURE(JOBCONF JOB) {
> SUPER.CONFIGURE(JOB);
> _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
> }
> PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
> FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
> INT COMBINED = 0;
> WHILE (VALUES.HASNEXT()) {
> INLINKS VAL = (INLINKS) VALUES.NEXT();
> FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
> IF (INLINKS.SIZE() >= _MAXINLINKS)
{
> IF (COMBINED > 0) {
>
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> RETURN;
> }
> INLINK IN = (INLINK) IT.NEXT();
> INLINKS.ADD(IN);
> }
> COMBINED++;
> }
> IF (INLINKS.SIZE() == 0) {
> RETURN;
> }
> IF (COMBINED > 0) {
> REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> }
> }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS 8717810541
> COMBINED 7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Closed: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-27 07:47:26 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]
DO?ACAN GüNEY CLOSED NUTCH-498.
-------------------------------
ISSUE RESOLVED AND COMMITTED.
> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
> KEY: NUTCH-498
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: LINKDB
> AFFECTS VERSIONS: 0.9.0
> REPORTER: ESPEN AMBLE KOLSTAD
> ASSIGNEE: DO?ACAN GüNEY
> PRIORITY: MINOR
> FIX FOR: 1.0.0
>
> ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
> PUBLIC STATIC ENUM COUNTERS
> PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
> PRIVATE INT _MAXINLINKS;
> OVERRIDE
> PUBLIC VOID CONFIGURE(JOBCONF JOB) {
> SUPER.CONFIGURE(JOB);
> _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
> }
> PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
> FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
> INT COMBINED = 0;
> WHILE (VALUES.HASNEXT()) {
> INLINKS VAL = (INLINKS) VALUES.NEXT();
> FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
> IF (INLINKS.SIZE() >= _MAXINLINKS)
{
> IF (COMBINED > 0) {
>
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> RETURN;
> }
> INLINK IN = (INLINK) IT.NEXT();
> INLINKS.ADD(IN);
> }
> COMBINED++;
> }
> IF (INLINKS.SIZE() == 0) {
> RETURN;
> }
> IF (COMBINED > 0) {
> REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> }
> }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS 8717810541
> COMBINED 7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-498) Use Combiner in
LinkDb to increase speed of linkdb
generation |
  United States |
2007-06-28 02:04:25 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12508748 ]
HUDSON COMMENTED ON NUTCH-498:
------------------------------
INTEGRATED IN NUTCH-NIGHTLY #131 (SEE
[HTTP://LUCENE.ZONES.APACHE.ORG:8080/HUDSON/JOB/NUTCH-NIGHTL
Y/131/])
> USE COMBINER IN LINKDB TO INCREASE SPEED OF LINKDB
GENERATION
>
------------------------------------------------------------
-
>
> KEY: NUTCH-498
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-498
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: LINKDB
> AFFECTS VERSIONS: 0.9.0
> REPORTER: ESPEN AMBLE KOLSTAD
> ASSIGNEE: DO?ACAN GüNEY
> PRIORITY: MINOR
> FIX FOR: 1.0.0
>
> ATTACHMENTS: LINKDBCOMBINER.PATCH,
LINKDBCOMBINER.PATCH
>
>
> I TRIED TO ADD THE FOLLWING COMBINER TO LINKDB
> PUBLIC STATIC ENUM COUNTERS
> PUBLIC STATIC CLASS LINKDBCOMBINER EXTENDS
MAPREDUCEBASE IMPLEMENTS REDUCER {
> PRIVATE INT _MAXINLINKS;
> OVERRIDE
> PUBLIC VOID CONFIGURE(JOBCONF JOB) {
> SUPER.CONFIGURE(JOB);
> _MAXINLINKS =
JOB.GETINT("DB.MAX.INLINKS", 10000);
> }
> PUBLIC VOID REDUCE(WRITABLECOMPARABLE KEY,
ITERATOR VALUES, OUTPUTCOLLECTOR OUTPUT, REPORTER REPORTER)
THROWS IOEXCEPTION {
> FINAL INLINKS INLINKS = (INLINKS)
VALUES.NEXT();
> INT COMBINED = 0;
> WHILE (VALUES.HASNEXT()) {
> INLINKS VAL = (INLINKS) VALUES.NEXT();
> FOR (ITERATOR IT = VAL.ITERATOR();
IT.HASNEXT();) {
> IF (INLINKS.SIZE() >= _MAXINLINKS)
{
> IF (COMBINED > 0) {
>
REPORTER.INCRCOUNTER(COUNTERS.COMBINED, COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> RETURN;
> }
> INLINK IN = (INLINK) IT.NEXT();
> INLINKS.ADD(IN);
> }
> COMBINED++;
> }
> IF (INLINKS.SIZE() == 0) {
> RETURN;
> }
> IF (COMBINED > 0) {
> REPORTER.INCRCOUNTER(COUNTERS.COMBINED,
COMBINED);
> }
> OUTPUT.COLLECT(KEY, INLINKS);
> }
> }
> THIS GREATLY REDUCED THE TIME IT TOOK TO GENERATE A NEW
LINKDB. IN MY CASE IT REDUCED THE TIME BY HALF.
> MAP OUTPUT RECORDS 8717810541
> COMBINED 7632541507
> RESULTING OUTPUT REC 1085269034
> THAT'S A 87% REDUCTION OF OUTPUT RECORDS FROM THE MAP
PHASE
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
[1-16]
|
|