List Info

Thread: Created: (NUTCH-395) Increase fetching speed




Created: (NUTCH-395) Increase fetching speed
user name
2006-10-29 20:41:17
Increase fetching speed
-----------------------

                 Key: NUTCH-395
                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 0.8.1
            Reporter: Sami Siren
         Assigned To: Sami Siren


There have been some discussion on nutch mailing lists about
fetcher being slow, this patch tried to address that. the
patch is just a quich hack and needs some cleaning up, it
also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?

Metadata - the original metadata uses spellchecking, new
version does not (a decorator is provided that can do it and
it should perhaps be used where http headers are handled but
in most of the cases the functionality is not required)

Reading/writing various data structures - patch tries to do
io more efficiently see the patch for details.

Initial benchmark:

A small benchmark was done to measure the performance of
changes with a script that basically does the following:
-inject a list of urls into a fresh crawldb
-create fetchlist (10k urls pointing to local filesystem)
-fetch
-updatedb

original code from 0.8-branch:
real    10m51.907s
user    10m9.914s
sys     0m21.285s

after applying the patch
real    4m15.313s
user    3m42.598s
sys     0m18.485s



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Updated: (NUTCH-395) Increase fetching speed
user name
2006-10-29 20:43:19
     [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren updated NUTCH-395:
-----------------------------

    Attachment: nutch-0.8-performance.txt

a rough patch for testing purposes

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-395) Increase fetching speed
user name
2006-10-30 09:48:17
    [ http://issues.apache.org/jira/browse
/NUTCH-395?page=comments#action_12445532 ] 
            
Andrzej Bialecki  commented on NUTCH-395:
-----------------------------------------

I have several comments to this patch:

* have you measured what made the biggest impact on
performance - changes to Metadata, or changes to IO in
FetcherOutput?

* I think it's a good idea to separate two concerns with
PlainMetadata / MetadataSpellChecker. Since the latter is a
subclass I think it would be more appropriate to name it
SpellCheckedMetadata.

* I'd also argue for keeping the name Metadata and just
replace the body of the class with PlainMetadata
implementation - this way we could avoid changing the API in
so many places; for compatibility we could just bump the
version number in Metadata. We could then avoid also changes
to version id-s of other classes that rely on Metadata, such
as Content, ParseData et al.

* new Metadata / SpellCheckedMetadata need JUnit tests -
this is important, because many other classes rely on proper
working of these classes.

* Fetcher.VoidReducer is not needed - I'm guessing you
wanted to use it just for logging.

* please observe formatting rules, especially whitespace
rules - this patch doesn't follow them.

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-395) Increase fetching speed
user name
2006-10-31 16:54:57
    [ http://issues.apache.org/jira/browse
/NUTCH-395?page=comments#action_12445956 ] 
            
Sami Siren commented on NUTCH-395:
----------------------------------

>have you measured what made the biggest impact on
performance - changes to Metadata, or
>changes to IO in FetcherOutput?
did not have time yet, I would quess that IO changes make
most signifigant part.

>I'd also argue for keeping the name Metadata and just
replace the body of the class with PlainMetadata
>implementation - this way we could avoid changing the
API in so many places; for compatibility we could
>just bump the version number in Metadata. We could then
avoid also changes to version id-s of other
>classes that rely on Metadata, such as Content,
ParseData et al.

The api for new metadata is exactly the same, but the
functionality changed so I decided to make a new class
totally, but Yes I agree here, It's much more clean to
replace the guts of Metadata class.

>new Metadata / SpellCheckedMetadata need JUnit tests -
this is important, because many other classes rely
>on proper working of these classes.
sure, there was supposed to be some allready in the patch
but I just forgot to svn add them.

Now that I remember, there was one more odd thing in current
implementation: the max number of links was not enforced
when writing outlinks only when reading them, I am planning
to change this also so the number of links is enforced on
write.

>Fetcher.VoidReducer is not needed - I'm guessing you
wanted to use it just for logging.
true

>please observe formatting rules, especially whitespace
rules - this patch doesn't follow them.

will do, as I said this was not meant to be a demonstration
of nice formatting or java coding, just wanted to throw out
the
findings for people to try them out. I'll start to work on a
new version against trunk  and will do it with more
focusused mindset 

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-395) Increase fetching speed
user name
2006-10-31 18:47:17
    [ http://issues.apache.org/jira/browse
/NUTCH-395?page=comments#action_12445994 ] 
            
Andrzej Bialecki  commented on NUTCH-395:
-----------------------------------------

> Now that I remember, there was one more odd thing in
current implementation: the max number
> of links was not enforced when writing outlinks only
when reading them, I am planning to change
> this also so the number of links is enforced on write. 

AFAIK this was done on purpose, to facilitate processing of
existing data created with different settings. I.e. if
someone created a segment with high max  # of outlinks, you
should still be able to read it and process all outlinks. If
you enforce the max # during reading you won't be able to
process this data.

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-395) Increase fetching speed
user name
2006-10-31 19:05:17
    [ http://issues.apache.org/jira/browse
/NUTCH-395?page=comments#action_12445999 ] 
            
Sami Siren commented on NUTCH-395:
----------------------------------

> settings. I.e. if someone created a segment with high
max # of outlinks, you should still be able
> to read it and process all outlinks. If you enforce the
max # during reading you won't be able
> to process this data.

Yes i agree, but IMO we should also not store more than
configured max # of links, now it seems we
store em all (or am i just not seeing it?).

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-395) Increase fetching speed
user name
2006-10-31 19:05:17
    [ http://issues.apache.org/jira/browse
/NUTCH-395?page=comments#action_12445999 ] 
            
Sami Siren commented on NUTCH-395:
----------------------------------

> settings. I.e. if someone created a segment with high
max # of outlinks, you should still be able
> to read it and process all outlinks. If you enforce the
max # during reading you won't be able
> to process this data.

Yes i agree, but IMO we should also not store more than
configured max # of links, now it seems we
store em all (or am i just not seeing it?).

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-395) Increase fetching speed
user name
2006-11-10 16:44:39
    [ http://issues.apache.org/jira/browse
/NUTCH-395?page=comments#action_12448795 ] 
            
Sami Siren commented on NUTCH-395:
----------------------------------

>>have you measured what made the biggest impact on
performance - changes to Metadata, or
>>changes to IO in FetcherOutput?
>did not have time yet, I would quess that IO changes
make most signifigant part. 

After more digging my initial guess might not have been
correct. By not touching IO at all
I am able to get same improvement changing the trunk when
comparing to nightly builds as
I reported before on 0.8 branch.

This is good, because we don't need to change file formats
at all.



> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Updated: (NUTCH-395) Increase fetching speed
user name
2006-11-11 08:57:38
     [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren updated NUTCH-395:
-----------------------------

    Attachment: NUTCH-395-trunk-metadata-only.patch

Here's a first stab at svn trunk version of nutch that just
optimizes the use of metadata and splits it into two
functionally distict pieces one for plain metadata and one
for spellchecking over the keys of metadata.

There's propably still room for optimization on both the
metadata and IO side also.

The same local filesystem fetching bench was run as earlier,
this time on trunk version. Even if the benchmark was run
witl file:// urls it should affect other protocols also
specifically because it seems to cut down the time needed
for reduce phase quite aggressively.

I would also recommend adding some kind of base benchmark
for crawling operations to nutch so we don't kill the
performance (again and again) at some point.

from svn trunk
----------------------
real    10m43.527s
user    10m11.210s
sys     0m21.837s

fetch breakdown:
5 min 19 sec	effective fetching
7 sec		sort
4 min 30 sec 	reduce > reduce


patched version
----------------------
real    4m53.742s
user    4m21.340s
sys     0m19.045s

fetch breakdown:
3 min 36 sec	effective fetching
8 sec		sort
27 sec 		reduce > reduce



> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0, 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt,
NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Updated: (NUTCH-395) Increase fetching speed
user name
2006-11-11 08:57:39
     [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren updated NUTCH-395:
-----------------------------

    Affects Version/s: 0.9.0

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0, 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt,
NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Updated: (NUTCH-395) Increase fetching speed
user name
2006-11-12 20:32:38
     [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren updated NUTCH-395:
-----------------------------

    Attachment: NUTCH-395-trunk-metadata-only-2.patch

Additional change to Content cuts down time needed in
effective fetching. Now seeing speeds like 45 pages/sec also
on http.

real    4m24.126s
user    3m53.835s
sys     0m18.681s

3 min 10 sec effective fetching
6 sec	sorting
27 sec  reduce > reduce

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0, 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt,
NUTCH-395-trunk-metadata-only-2.patch,
NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Commented: (NUTCH-395) Increase fetching speed
user name
2006-11-13 09:59:38
    [ http://issues.apache.org/jira/browse
/NUTCH-395?page=comments#action_12449292 ] 
            
Andrzej Bialecki  commented on NUTCH-395:
-----------------------------------------

+1 - this patch looks good to me - if you could just fix the
whitespace issues prior to committing, so that it conforms
to the coding style ...

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0, 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt,
NUTCH-395-trunk-metadata-only-2.patch,
NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Resolved: (NUTCH-395) Increase fetching speed
user name
2006-11-13 19:50:38
     [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren resolved NUTCH-395.
------------------------------

    Fix Version/s: 0.9.0
       Resolution: Fixed

applied to trunk with some additional whitespace changes.

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.8-performance.txt,
NUTCH-395-trunk-metadata-only-2.patch,
NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists
about fetcher being slow, this patch tried to address that.
the patch is just a quich hack and needs some cleaning up,
it also currently applies to 0.8 branch and not trunk and it
has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking,
new version does not (a decorator is provided that can do it
and it should perhaps be used where http headers are handled
but in most of the cases the functionality is not required)
> Reading/writing various data structures - patch tries
to do io more efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance
of changes with a script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local
filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: http://www.atl
assian.com/software/jira

        
Resolved: (NUTCH-395) Increase fetching speed
user name
2006-11-14 06:54:34
Sami,
Thanks for resolving this serious issue.  I just updated my
code from trunk
and plan to test fetch speed. But ,there is a runtime error
related to
switching from UTF8 to Text. Since the error is from hadoop,
how do I fix
it?

java.lang.ClassCastException: org.apache.hadoop.io.UTF8
    at org.apache.nutch.crawl.Generato
r$Selector.map(Generator.java:108)
    at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
    at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java
:105)

Thanks,
AJ


On 11/13/06, Sami Siren (JIRA) <jiraapache.org> wrote:
>
>      [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]
>
> Sami Siren resolved NUTCH-395.
> ------------------------------
>
>     Fix Version/s: 0.9.0
>        Resolution: Fixed
>
> applied to trunk with some additional whitespace
changes.
>
> > Increase fetching speed
> > -----------------------
> >
> >                 Key: NUTCH-395
> >                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: fetcher
> >    Affects Versions: 0.8.1, 0.9.0
> >            Reporter: Sami Siren
> >         Assigned To: Sami Siren
> >             Fix For: 0.9.0
> >
> >         Attachments: nutch-0.8-performance.txt,
> NUTCH-395-trunk-metadata-only-2.patch,
NUTCH-395-trunk-metadata-only.patch
> >
> >
> > There have been some discussion on nutch mailing
lists about fetcher
> being slow, this patch tried to address that. the patch
is just a quich hack
> and needs some cleaning up, it also currently applies
to 0.8 branch and
> not trunk and it has also not been tested in large.
What it changes?
> > Metadata - the original metadata uses
spellchecking, new version does
> not (a decorator is provided that can do it and it
should perhaps be used
> where http headers are handled but in most of the cases
the functionality is
> not required)
> > Reading/writing various data structures - patch
tries to do io more
> efficiently see the patch for details.
> > Initial benchmark:
> > A small benchmark was done to measure the
performance of changes with a
> script that basically does the following:
> > -inject a list of urls into a fresh crawldb
> > -create fetchlist (10k urls pointing to local
filesystem)
> > -fetch
> > -updatedb
> > original code from 0.8-branch:
> > real    10m51.907s
> > user    10m9.914s
> > sys     0m21.285s
> > after applying the patch
> > real    4m15.313s
> > user    3m42.598s
> > sys     0m18.485s
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa

> -
> For more information on JIRA, see: http://www.atl
assian.com/software/jira
>
>
>


-- 
AJ Chen, PhD
http://web2express.org
Resolved: (NUTCH-395) Increase fetching speed
user name
2006-11-14 06:54:34
Sami,
Thanks for resolving this serious issue.  I just updated my
code from trunk
and plan to test fetch speed. But ,there is a runtime error
related to
switching from UTF8 to Text. Since the error is from hadoop,
how do I fix
it?

java.lang.ClassCastException: org.apache.hadoop.io.UTF8
    at org.apache.nutch.crawl.Generato
r$Selector.map(Generator.java:108)
    at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
    at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java
:105)

Thanks,
AJ


On 11/13/06, Sami Siren (JIRA) <jiraapache.org> wrote:
>
>      [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]
>
> Sami Siren resolved NUTCH-395.
> ------------------------------
>
>     Fix Version/s: 0.9.0
>        Resolution: Fixed
>
> applied to trunk with some additional whitespace
changes.
>
> > Increase fetching speed
> > -----------------------
> >
> >                 Key: NUTCH-395
> >                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: fetcher
> >    Affects Versions: 0.8.1, 0.9.0
> >            Reporter: Sami Siren
> >         Assigned To: Sami Siren
> >             Fix For: 0.9.0
> >
> >         Attachments: nutch-0.8-performance.txt,
> NUTCH-395-trunk-metadata-only-2.patch,
NUTCH-395-trunk-metadata-only.patch
> >
> >
> > There have been some discussion on nutch mailing
lists about fetcher
> being slow, this patch tried to address that. the patch
is just a quich hack
> and needs some cleaning up, it also currently applies
to 0.8 branch and
> not trunk and it has also not been tested in large.
What it changes?
> > Metadata - the original metadata uses
spellchecking, new version does
> not (a decorator is provided that can do it and it
should perhaps be used
> where http headers are handled but in most of the cases
the functionality is
> not required)
> > Reading/writing various data structures - patch
tries to do io more
> efficiently see the patch for details.
> > Initial benchmark:
> > A small benchmark was done to measure the
performance of changes with a
> script that basically does the following:
> > -inject a list of urls into a fresh crawldb
> > -create fetchlist (10k urls pointing to local
filesystem)
> > -fetch
> > -updatedb
> > original code from 0.8-branch:
> > real    10m51.907s
> > user    10m9.914s
> > sys     0m21.285s
> > after applying the patch
> > real    4m15.313s
> > user    3m42.598s
> > sys     0m18.485s
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa

> -
> For more information on JIRA, see: http://www.atl
assian.com/software/jira
>
>
>


-- 
AJ Chen, PhD
http://web2express.org
Resolved: (NUTCH-395) Increase fetching speed
user name
2006-11-14 14:55:17
from what version are you "upgrading" from? I
guess pre rev. 464654?

If so, see [1] for additional info.

--
  Sami Siren

[1] http://wiki.apache.org/nutch/Upgrading_from_0%2e8%
2ex_to_0%2e9

AJ Chen wrote:
> Sami,
> Thanks for resolving this serious issue.  I just
updated my code from trunk
> and plan to test fetch speed. But ,there is a runtime
error related to
> switching from UTF8 to Text. Since the error is from
hadoop, how do I fix
> it?
> 
> java.lang.ClassCastException: org.apache.hadoop.io.UTF8
>    at org.apache.nutch.crawl.Generato
r$Selector.map(Generator.java:108)
>    at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>    at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
>    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java
> :105)
> 
> Thanks,
> AJ
Resolved: (NUTCH-395) Increase fetching speed
user name
2006-11-14 06:54:34
Sami,
Thanks for resolving this serious issue.  I just updated my
code from trunk
and plan to test fetch speed. But ,there is a runtime error
related to
switching from UTF8 to Text. Since the error is from hadoop,
how do I fix
it?

java.lang.ClassCastException: org.apache.hadoop.io.UTF8
    at org.apache.nutch.crawl.Generato
r$Selector.map(Generator.java:108)
    at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
    at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java
:105)

Thanks,
AJ


On 11/13/06, Sami Siren (JIRA) <jiraapache.org> wrote:
>
>      [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]
>
> Sami Siren resolved NUTCH-395.
> ------------------------------
>
>     Fix Version/s: 0.9.0
>        Resolution: Fixed
>
> applied to trunk with some additional whitespace
changes.
>
> > Increase fetching speed
> > -----------------------
> >
> >                 Key: NUTCH-395
> >                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: fetcher
> >    Affects Versions: 0.8.1, 0.9.0
> >            Reporter: Sami Siren
> >         Assigned To: Sami Siren
> >             Fix For: 0.9.0
> >
> >         Attachments: nutch-0.8-performance.txt,
> NUTCH-395-trunk-metadata-only-2.patch,
NUTCH-395-trunk-metadata-only.patch
> >
> >
> > There have been some discussion on nutch mailing
lists about fetcher
> being slow, this patch tried to address that. the patch
is just a quich hack
> and needs some cleaning up, it also currently applies
to 0.8 branch and
> not trunk and it has also not been tested in large.
What it changes?
> > Metadata - the original metadata uses
spellchecking, new version does
> not (a decorator is provided that can do it and it
should perhaps be used
> where http headers are handled but in most of the cases
the functionality is
> not required)
> > Reading/writing various data structures - patch
tries to do io more
> efficiently see the patch for details.
> > Initial benchmark:
> > A small benchmark was done to measure the
performance of changes with a
> script that basically does the following:
> > -inject a list of urls into a fresh crawldb
> > -create fetchlist (10k urls pointing to local
filesystem)
> > -fetch
> > -updatedb
> > original code from 0.8-branch:
> > real    10m51.907s
> > user    10m9.914s
> > sys     0m21.285s
> > after applying the patch
> > real    4m15.313s
> > user    3m42.598s
> > sys     0m18.485s
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa

> -
> For more information on JIRA, see: http://www.atl
assian.com/software/jira
>
>
>


-- 
AJ Chen, PhD
http://web2express.org
Resolved: (NUTCH-395) Increase fetching speed
user name
2006-11-14 14:55:17
from what version are you "upgrading" from? I
guess pre rev. 464654?

If so, see [1] for additional info.

--
  Sami Siren

[1] http://wiki.apache.org/nutch/Upgrading_from_0%2e8%
2ex_to_0%2e9

AJ Chen wrote:
> Sami,
> Thanks for resolving this serious issue.  I just
updated my code from trunk
> and plan to test fetch speed. But ,there is a runtime
error related to
> switching from UTF8 to Text. Since the error is from
hadoop, how do I fix
> it?
> 
> java.lang.ClassCastException: org.apache.hadoop.io.UTF8
>    at org.apache.nutch.crawl.Generato
r$Selector.map(Generator.java:108)
>    at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>    at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
>    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java
> :105)
> 
> Thanks,
> AJ
Commented: (NUTCH-395) Increase fetching speed
user name
2006-11-22 17:09:12
I checked out the code from trunk after Sami committed the
change. I started
out a new crawl db and run several cycles of crawl
sequentially on one linux
server. See below for the real numbers from my test.  The
performance is
still poor because the crawler still spend too much time in
reduce and
update operations.

#crawl cycle: topN=200000
2006-11-17 17:25:27,367 INFO  crawl.Generator - Generator:
segment:
crawl/segments/20061117172527
2006-11-17 17:47:45,837 INFO  fetcher.Fetcher - Fetcher:
segment:
crawl/segments/20061117172527
# 8 hours fetching ~200000 pages
2006-11-18 03:13:31,992 INFO  mapred.LocalJobRunner - 183644
pages, 5506
errors, 5.4 pages/s, 1043 kb/s,
# 4 hours doing "reduce"
2006-11-18 07:30:38,085 INFO  crawl.CrawlDb - CrawlDb
update: starting
# 4 hours update db
2006-11-18 11:17:54,000 INFO  crawl.CrawlDb - CrawlDb
update: done

#crawl sycle: topN=500,000 pages
2006-11-18 13:22:51,530 INFO  crawl.Generator - Generator:
segment:
crawl/segments/20061118132251
2006-11-18 14:50:07,006 INFO  fetcher.Fetcher - Fetcher:
segment:
crawl/segments/20061118132251
# fetching for 16 hours
2006-11-19 06:53:34,923 INFO  mapred.LocalJobRunner - 394343
pages, 19050
errors, 6.8 pages/s, 1439 kb/s,
# reduce for 11 hours
2006-11-19 17:49:15,778 INFO  crawl.CrawlDb - CrawlDb
update: segment:
crawl/segments/20061118132251
# update db for 10 hours
2006-11-20 03:55:22,882 INFO  crawl.CrawlDb - CrawlDb
update: done

#crawl cycle: topN=600,000 pages
2006-11-20 08:14:51,463 INFO  crawl.Generator - Generator:
segment:
crawl/segments/20061120081451
2006-11-20 11:31:22,384 INFO  fetcher.Fetcher - Fetcher:
segment:
crawl/segments/20061120081451
#fetching for 18 hours
2006-11-21 06:00:08,504 INFO  mapred.LocalJobRunner - 410078
pages, 26316
errors, 6.2 pages/s, 1257 kb/s,
#reduce for 11 hours
2006-11-21 17:26:38,213 INFO  crawl.CrawlDb - CrawlDb
update: starting
#update for 13 hours
2006-11-22 06:25:48,592 INFO  crawl.CrawlDb - CrawlDb
update: done


-AJ


On 11/13/06, Andrzej Bialecki (JIRA) <jiraapache.org> wrote:
>
>     [
> http://issues.apache.org/jira/browse
/NUTCH-395?page=comments#action_12449292]
>
> Andrzej Bialecki  commented on NUTCH-395:
> -----------------------------------------
>
> +1 - this patch looks good to me - if you could just
fix the whitespace
> issues prior to committing, so that it conforms to the
coding style ...
>
> > Increase fetching speed
> > -----------------------
> >
> >                 Key: NUTCH-395
> >                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: fetcher
> >    Affects Versions: 0.9.0, 0.8.1
> >            Reporter: Sami Siren
> >         Assigned To: Sami Siren
> >         Attachments: nutch-0.8-performance.txt,
> NUTCH-395-trunk-metadata-only-2.patch,
NUTCH-395-trunk-metadata-only.patch
> >
> >
> > There have been some discussion on nutch mailing
lists about fetcher
> being slow, this patch tried to address that. the patch
is just a quich hack
> and needs some cleaning up, it also currently applies
to 0.8 branch and
> not trunk and it has also not been tested in large.
What it changes?
> > Metadata - the original metadata uses
spellchecking, new version does
> not (a decorator is provided that can do it and it
should perhaps be used
> where http headers are handled but in most of the cases
the functionality is
> not required)
> > Reading/writing various data structures - patch
tries to do io more
> efficiently see the patch for details.
> > Initial benchmark:
> > A small benchmark was done to measure the
performance of changes with a
> script that basically does the following:
> > -inject a list of urls into a fresh crawldb
> > -create fetchlist (10k urls pointing to local
filesystem)
> > -fetch
> > -updatedb
> > original code from 0.8-branch:
> > real    10m51.907s
> > user    10m9.914s
> > sys     0m21.285s
> > after applying the patch
> > real    4m15.313s
> > user    3m42.598s
> > sys     0m18.485s
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa

> -
> For more information on JIRA, see: http://www.atl
assian.com/software/jira
>
>
>


-- 
AJ Chen, PhD
Palo Alto, CA
http://web2express.org
Commented: (NUTCH-395) Increase fetching speed
user name
2006-11-22 18:20:48
What kind of hardware are you running on? Your pages per sec
ratio seems 
very low to me.

How big was your crawldb when you started and how big was it
at end?

What kind of filters and normalizers are you using?

--
  Sami Siren

AJ Chen wrote:
> I checked out the code from trunk after Sami committed
the change. I 
> started
> out a new crawl db and run several cycles of crawl
sequentially on one 
> linux
> server. See below for the real numbers from my test. 
The performance is
> still poor because the crawler still spend too much
time in reduce and
> update operations.
> 
> #crawl cycle: topN=200000
> 2006-11-17 17:25:27,367 INFO  crawl.Generator -
Generator: segment:
> crawl/segments/20061117172527
> 2006-11-17 17:47:45,837 INFO  fetcher.Fetcher -
Fetcher: segment:
> crawl/segments/20061117172527
> # 8 hours fetching ~200000 pages
> 2006-11-18 03:13:31,992 INFO  mapred.LocalJobRunner -
183644 pages, 5506
> errors, 5.4 pages/s, 1043 kb/s,
> # 4 hours doing "reduce"
> 2006-11-18 07:30:38,085 INFO  crawl.CrawlDb - CrawlDb
update: starting
> # 4 hours update db
> 2006-11-18 11:17:54,000 INFO  crawl.CrawlDb - CrawlDb
update: done
> 
> #crawl sycle: topN=500,000 pages
> 2006-11-18 13:22:51,530 INFO  crawl.Generator -
Generator: segment:
> crawl/segments/20061118132251
> 2006-11-18 14:50:07,006 INFO  fetcher.Fetcher -
Fetcher: segment:
> crawl/segments/20061118132251
> # fetching for 16 hours
> 2006-11-19 06:53:34,923 INFO  mapred.LocalJobRunner -
394343 pages, 19050
> errors, 6.8 pages/s, 1439 kb/s,
> # reduce for 11 hours
> 2006-11-19 17:49:15,778 INFO  crawl.CrawlDb - CrawlDb
update: segment:
> crawl/segments/20061118132251
> # update db for 10 hours
> 2006-11-20 03:55:22,882 INFO  crawl.CrawlDb - CrawlDb
update: done
> 
> #crawl cycle: topN=600,000 pages
> 2006-11-20 08:14:51,463 INFO  crawl.Generator -
Generator: segment:
> crawl/segments/20061120081451
> 2006-11-20 11:31:22,384 INFO  fetcher.Fetcher -
Fetcher: segment:
> crawl/segments/20061120081451
> #fetching for 18 hours
> 2006-11-21 06:00:08,504 INFO  mapred.LocalJobRunner -
410078 pages, 26316
> errors, 6.2 pages/s, 1257 kb/s,
> #reduce for 11 hours
> 2006-11-21 17:26:38,213 INFO  crawl.CrawlDb - CrawlDb
update: starting
> #update for 13 hours
> 2006-11-22 06:25:48,592 INFO  crawl.CrawlDb - CrawlDb
update: done
> 
> 
> -AJ
> 
> 
> On 11/13/06, Andrzej Bialecki (JIRA) <jiraapache.org> wrote:
>>
>>     [
>> http://issues.apache.org/jira/browse
/NUTCH-395?page=comments#action_12449292] 
>>
>>
>> Andrzej Bialecki  commented on NUTCH-395:
>> -----------------------------------------
>>
>> +1 - this patch looks good to me - if you could
just fix the whitespace
>> issues prior to committing, so that it conforms to
the coding style ...
>>
>> > Increase fetching speed
>> > -----------------------
>> >
>> >                 Key: NUTCH-395
>> >                 URL: http:/
/issues.apache.org/jira/browse/NUTCH-395
>> >             Project: Nutch
>> >          Issue Type: Improvement
>> >          Components: fetcher
>> >    Affects Versions: 0.9.0, 0.8.1
>> >            Reporter: Sami Siren
>> >         Assigned To: Sami Siren
>> >         Attachments:
nutch-0.8-performance.txt,
>> NUTCH-395-trunk-metadata-only-2.patch, 
>> NUTCH-395-trunk-metadata-only.patch
>> >
>> >
>> > There have been some discussion on nutch
mailing lists about fetcher
>> being slow, this patch tried to address that. the
patch is just a 
>> quich hack
>> and needs some cleaning up, it also currently
applies to 0.8 branch and
>> not trunk and it has also not been tested in large.
What it changes?
>> > Metadata - the original metadata uses
spellchecking, new version does
>> not (a decorator is provided that can do it and it
should perhaps be used
>> where http headers are handled but in most of the
cases the 
>> functionality is
>> not required)
>> > Reading/writing various data structures -
patch tries to do io more
>> efficiently see the patch for details.
>> > Initial benchmark:
>> > A small benchmark was done to measure the
performance of changes with a
>> script that basically does the following:
>> > -inject a list of urls into a fresh crawldb
>> > -create fetchlist (10k urls pointing to local
filesystem)
>> > -fetch
>> > -updatedb
>> > original code from 0.8-branch:
>> > real    10m51.907s
>> > user    10m9.914s
>> > sys     0m21.285s
>> > after applying the patch
>> > real    4m15.313s
>> > user    3m42.598s
>> > sys     0m18.485s
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of
the administrators:
>> http://issues.apache.org/jira/secure/Administrators.jspa

>> -
>> For more information on JIRA, see: http://www.atl
assian.com/software/jira
>>
>>
>>
> 
> 

[1-20] [21]

about | contact  Other archives ( Real Estate discussion Medical topics )