|
List Info
Thread: Hadoop fetch jobs
|
|
| Hadoop fetch jobs |

|
2007-10-16 05:28:01 |
Hello, i've succesfully set up cluster of 3 machines under
hadoop. However i
have a problem. While fetching hadoop generates 6 jobs,
however the number
of pages in each of those jobs is not spread equally i get 5
jobs with ~ 3
500 pages and one with ~ 50 000. That's not a good thing as
5 jobs finish
very quickly and afterwards only one machine is working
while others are
waiting. Could this be a problem with my configuration, i've
set number of
map jobs to 30, number of reduce jobs to 6 and fetcher
threads to 150,
however during fetch i still get only 6 map jobs. Any help
would be
appreciated, thanks.
--
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of
Internet Technology
and Management
+48(17)8661277
|
|
| Re: Hadoop fetch jobs |
  United States |
2007-10-16 08:41:26 |
This is because some of the websites you are fetching have
an unusually
large number of pages. Since Nutch partitions by hostname,
all of these
pages get assigned to a single fetcher. The way to avoid
this is to set
a maximum number of pages per site through the
generate.max.per.host
configuration variable. In production we have this set to
10.
The downside of this is that some very large sites which you
may want to
fetch all of their content (i.e. wikipedia) still will only
fetch the
top 10 pages of that site per fetch cycle.
Dennis
Karol Rybak wrote:
> Hello, i've succesfully set up cluster of 3 machines
under hadoop. However i
> have a problem. While fetching hadoop generates 6 jobs,
however the number
> of pages in each of those jobs is not spread equally i
get 5 jobs with ~ 3
> 500 pages and one with ~ 50 000. That's not a good
thing as 5 jobs finish
> very quickly and afterwards only one machine is working
while others are
> waiting. Could this be a problem with my configuration,
i've set number of
> map jobs to 30, number of reduce jobs to 6 and fetcher
threads to 150,
> however during fetch i still get only 6 map jobs. Any
help would be
> appreciated, thanks.
>
|
|
| Re: Hadoop fetch jobs |

|
2007-10-18 04:46:27 |
Well, that's not the case i have found out that those jobs
have proper
number of pages , however they end prematurely as fetcher
fails with out of
memory exception. Now i'm trying to fetch it without
parsing, we'll see what
happens...
On 10/16/07, Dennis Kubes <kubes apache.org> wrote:
>
> This is because some of the websites you are fetching
have an unusually
> large number of pages. Since Nutch partitions by
hostname, all of these
> pages get assigned to a single fetcher. The way to
avoid this is to set
> a maximum number of pages per site through the
generate.max.per.host
> configuration variable. In production we have this set
to 10.
>
> The downside of this is that some very large sites
which you may want to
> fetch all of their content (i.e. wikipedia) still will
only fetch the
> top 10 pages of that site per fetch cycle.
>
> Dennis
>
> Karol Rybak wrote:
> > Hello, i've succesfully set up cluster of 3
machines under hadoop.
> However i
> > have a problem. While fetching hadoop generates 6
jobs, however the
> number
> > of pages in each of those jobs is not spread
equally i get 5 jobs with ~
> 3
> > 500 pages and one with ~ 50 000. That's not a good
thing as 5 jobs
> finish
> > very quickly and afterwards only one machine is
working while others are
> > waiting. Could this be a problem with my
configuration, i've set number
> of
> > map jobs to 30, number of reduce jobs to 6 and
fetcher threads to 150,
> > however during fetch i still get only 6 map jobs.
Any help would be
> > appreciated, thanks.
> >
>
--
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of
Internet Technology
and Management
+48(17)8661277
|
|
| Re: Hadoop fetch jobs |

|
2007-10-18 08:24:22 |
Actually setting -noParsing helped but only a bit i got
about 6000 pages
fetched per job (1000 earlier). I'll try using fetch instead
of fetch2, hope
that this will help. Another question is how do i control
the number of
fetch jobs, cause they do not behave as typical map jobs ?
On 10/18/07, Karol Rybak <karolrybak gmail.com> wrote:
>
> Well, that's not the case i have found out that those
jobs have proper
> number of pages , however they end prematurely as
fetcher fails with out of
> memory exception. Now i'm trying to fetch it without
parsing, we'll see what
> happens...
>
> On 10/16/07, Dennis Kubes <kubes apache.org > wrote:
> >
> > This is because some of the websites you are
fetching have an unusually
> > large number of pages. Since Nutch partitions by
hostname, all of these
> > pages get assigned to a single fetcher. The way
to avoid this is to set
> >
> > a maximum number of pages per site through the
generate.max.per.host
> > configuration variable. In production we have
this set to 10.
> >
> > The downside of this is that some very large sites
which you may want to
> > fetch all of their content (i.e. wikipedia) still
will only fetch the
> > top 10 pages of that site per fetch cycle.
> >
> > Dennis
> >
> > Karol Rybak wrote:
> > > Hello, i've succesfully set up cluster of 3
machines under hadoop.
> > However i
> > > have a problem. While fetching hadoop
generates 6 jobs, however the
> > number
> > > of pages in each of those jobs is not spread
equally i get 5 jobs with
> > ~ 3
> > > 500 pages and one with ~ 50 000. That's not a
good thing as 5 jobs
> > finish
> > > very quickly and afterwards only one machine
is working while others
> > are
> > > waiting. Could this be a problem with my
configuration, i've set
> > number of
> > > map jobs to 30, number of reduce jobs to 6
and fetcher threads to 150,
> >
> > > however during fetch i still get only 6 map
jobs. Any help would be
> > > appreciated, thanks.
> > >
> >
>
>
>
> --
> Karol Rybak
> Programista / Programmer
> Sekcja aplikacji / Applications section
> Wyższa Szkoła Informatyki i Zarządzania / University of
Internet
> Technology and Management
> +48(17)8661277
>
--
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of
Internet Technology
and Management
+48(17)8661277
|
|
[1-4]
|
|
|
about | contact Other archives ( Real Estate discussion Medical topics )
|