|
List Info
Thread: Re: Platform reliability with Hadoop
|
|
| Re: Platform reliability with Hadoop |

|
2008-01-20 13:43:49 |
you might want to change hadoop.tmp.dir entry alone. since
others are derived out of this, everything should be fine.
i am wondering if hadoop.tmp.dir might be used elsewhere
thanks,
lohit
----- Original Message ----
From: Jeff Eastman <jeastman collab.net>
To: hadoop-user lucene.apache.org
Sent: Sunday, January 20, 2008 11:05:28 AM
Subject: RE: Platform reliability with Hadoop
I am almost operational again but something in my
configuration is
still
not quite right. Here's what I did:
- I created a directory /u1/cloud-data on every machine's
local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir
entries for:
- mapred.system.dir
- mapred.local.dir
- dfs.name.dir
- dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all
machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as
'jeastman'
- When I submit a job as 'jeastman' that used to run, it
runs but
cannot
locate any input data so it quits immediately with this in
the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:500
30/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:
I've attached my site.xml file.
Jeff
-----Original Message-----
From: Jason Venner [mailto:jason attributor.com]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user lucene.apache.org
Subject: Re: Platform reliability with Hadoop
The /tmp default has caught us once or twice too. Now we put
the files
elsewhere.
lohit.vijayarenu yahoo.com wrote:
>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally
reboot and
reprofile
them
>>
>
> Wont you lose your blocks after reboot since /tmp gets
cleaned up?
Could this be the reason you see data corruption?
> Good idea is to configure DFS to be any place other
than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman collab.net>
> To: hadoop-user lucene.apache.org
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently,
0.15.2 on a dozen
> machines in our CUBiT array for the last month. During
this time I
have
> experienced two major data corruption losses on
relatively small
> amounts
> of data (<50gb) that make me wonder about the
suitability of this
> platform for hosting Hadoop. CUBiT is one of our
products for
managing
> a
> pool of development servers, allowing developers to
check out
machines,
> install various OS profiles on them and monitor their
utilization via
> the web. With most machines reporting very low
utilization it seemed
a
> natural place to run Hadoop in the background. I have
an NFS-mounted
> account on all of the machines and have installed
Hadoop there. The
DFS
> is stored in /tmp on each box. The developers who own
the machines
> occasionally reboot and reprofile them, but this occurs
infrequently
> and
> does not clobber /tmp. Hadoop is designed to deal with
slave failures
> of
> this nature, though this platform may well be an acid
test.
>
>
>
> My initial cloud was configured for replication factor
of 3 and I
have
> increased that now to 4 in hopes of improving data
reliability in the
> face of these more-prevalent slave outages. Ted Dunning
has suggested
> aggressive rebalancing in his recent posts and I have
done this by
> increasing replication to 5 (from 3) and then dropping
it to 4. Are
> there other rebalancing or configuration techniques
that might
improve
> my data reliability? Or, is this platform just too
unstable to be a
> good
> fit for Hadoop?
>
>
>
> Jeff
>
>
>
>
>
-----Inline Attachment Follows-----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file.
-->
<configuration>
<!--- global properties -->
<property>
<name>mapred.system.dir</name>
<value>/u1/cloud-data/mapred/system</value>
<description>The shared directory where MapReduce
stores control
files.
</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>/u1/cloud-data/mapred/local</value>
<description>The local directory where MapReduce
stores intermediate
data files. May be a comma-separated list of
directories on different devices in order to spread disk
i/o.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>mapred.job.tracker.info.port</name>
<value>50030</value>
<description>The port that the MapReduce job tracker
info webserver
runs at.
</description>
</property>
<property>
<name>dfs.secondary.info.port</name>
<value>50090</value>
<description>The base number for the Secondary
namenode info port.
</description>
</property>
<property>
<name>dfs.datanode.port</name>
<value>50010</value>
<description>The port number that the dfs datanode
server uses as a
starting
point to look for a free port to listen on.
</description>
</property>
<property>
<name>dfs.info.port</name>
<value>50070</value>
<description>The base port number for the dfs
namenode web ui.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<description>A base for other temporary
directories.</description>
</property>
<!-- file system properties -->
<property>
<name>fs.default.name</name>
<value>hdfs://cu027.cubit.sp.collab.net:54310</valu
e>
<description>The name of the default file system.
A URI whose
scheme and authority determine the FileSystem
implementation. The
uri's scheme determines the config property
(fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority
is used to
determine the host, port, etc. for a filesystem.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/u1/cloud-data/dfs/name</value>
<description>Determines where on the local
filesystem the DFS name
node
should store the name table. If this is a
comma-delimited list
of directories then the name table is replicated in
all of the
directories, for redundancy. </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/u1/cloud-data/dfs/data</value>
<description>Determines where on the local
filesystem an DFS data
node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all
named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>0</value>
<description>Reserved space in bytes per volume.
Always leave this
much space free for non dfs use.
</description>
</property>
<property>
<name>dfs.datanode.du.pct</name>
<value>0.50f</value>
<description>When calculating remaining space, only
use this
percentage of the real available space
</description>
</property>
<property>
<name>dfs.replication</name>
<value>4</value>
<description>Default block replication.
The actual number of replications can be specified when
the file is
created.
The default is used if replication is not specified in
create time.
</description>
</property>
<!-- map/reduce properties -->
<property>
<name>mapred.job.tracker</name>
<value>cu063.cubit.sp.collab.net:54311</value>
<description>The host and port that the MapReduce
job tracker runs
at. If "local", then jobs are run in-process
as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>31</value>
<description>The default number of map tasks per
job. Typically set
to a prime several times greater than number of available
hosts.
Ignored when mapred.job.tracker is "local".
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>11</value>
<description>The default number of reduce tasks per
job. Typically
set
to a prime close to the number of available hosts.
Ignored when
mapred.job.tracker is "local".
</description>
</property>
</configuration>
|
|
| RE: Platform reliability with Hadoop |

|
2008-01-21 13:14:59 |
Is it really that simple?
The Wiki page GettingStartedWithHadoop recommends setting
dfs.name.dir,
dfs.data.dir, dfs.client.buffer.dir and mapred.local.dir to
"appropriate" values (without giving an example).
Should these be fixed
(XX) or variable (XX-${user.name}) values? The FAQ page
recommends
setting the mapred.system.dir to a fixed value (e.g.
/hadoop/mapred/system), so I chose fixed values too.
- dfs.name.dir - /u1/cloud-data
- dfs.data.dir - /u1/cloud-data
- mapred.system.dir - /u1/cloud-data
- mapred.local.dir - /u1/cloud-data
I did not overwrite the dfs.client.buffer.dir (Determines
where on the
local filesystem an DFS client should store its blocks
before it sends
them to the datanode) because my 'jeastman' client could not
put data
into the dfs with it set to the fixed value.
There are 4 other settings that use the ${hadoop.tmp.dir},
and these
seem appropriately tmp-ish. I did not redefine them:
- fs.trash.root - The trash directory, used by FsShell's
'rm' command.
- fs.checkpoint.dir - Determines where on the local
filesystem the DFS
secondary name node should store the temporary images and
edits to
merge.
- fs.s3.buffer.dir - Determines where on the local
filesystem the S3
filesystem should store its blocks before it sends them to
S3 or after
it retrieves them from S3.
- mapred.temp.dir - A shared directory for temporary files.
Jeff
-----Original Message-----
From: lohit.vijayarenu yahoo.com [mailto:lohit.vijayarenu yahoo.com]
Sent: Sunday, January 20, 2008 11:44 AM
To: hadoop-user lucene.apache.org
Subject: Re: Platform reliability with Hadoop
you might want to change hadoop.tmp.dir entry alone. since
others are
derived out of this, everything should be fine.
i am wondering if hadoop.tmp.dir might be used elsewhere
thanks,
lohit
----- Original Message ----
From: Jeff Eastman <jeastman collab.net>
To: hadoop-user lucene.apache.org
Sent: Sunday, January 20, 2008 11:05:28 AM
Subject: RE: Platform reliability with Hadoop
I am almost operational again but something in my
configuration is
still
not quite right. Here's what I did:
- I created a directory /u1/cloud-data on every machine's
local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir
entries for:
- mapred.system.dir
- mapred.local.dir
- dfs.name.dir
- dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all
machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as
'jeastman'
- When I submit a job as 'jeastman' that used to run, it
runs but
cannot
locate any input data so it quits immediately with this in
the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:500
30/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:
I've attached my site.xml file.
Jeff
-----Original Message-----
From: Jason Venner [mailto:jason attributor.com]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user lucene.apache.org
Subject: Re: Platform reliability with Hadoop
The /tmp default has caught us once or twice too. Now we put
the files
elsewhere.
lohit.vijayarenu yahoo.com wrote:
>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally
reboot and
reprofile
them
>>
>
> Wont you lose your blocks after reboot since /tmp gets
cleaned up?
Could this be the reason you see data corruption?
> Good idea is to configure DFS to be any place other
than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman collab.net>
> To: hadoop-user lucene.apache.org
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently,
0.15.2 on a dozen
> machines in our CUBiT array for the last month. During
this time I
have
> experienced two major data corruption losses on
relatively small
> amounts
> of data (<50gb) that make me wonder about the
suitability of this
> platform for hosting Hadoop. CUBiT is one of our
products for
managing
> a
> pool of development servers, allowing developers to
check out
machines,
> install various OS profiles on them and monitor their
utilization via
> the web. With most machines reporting very low
utilization it seemed
a
> natural place to run Hadoop in the background. I have
an NFS-mounted
> account on all of the machines and have installed
Hadoop there. The
DFS
> is stored in /tmp on each box. The developers who own
the machines
> occasionally reboot and reprofile them, but this occurs
infrequently
> and
> does not clobber /tmp. Hadoop is designed to deal with
slave failures
> of
> this nature, though this platform may well be an acid
test.
>
>
>
> My initial cloud was configured for replication factor
of 3 and I
have
> increased that now to 4 in hopes of improving data
reliability in the
> face of these more-prevalent slave outages. Ted Dunning
has suggested
> aggressive rebalancing in his recent posts and I have
done this by
> increasing replication to 5 (from 3) and then dropping
it to 4. Are
> there other rebalancing or configuration techniques
that might
improve
> my data reliability? Or, is this platform just too
unstable to be a
> good
> fit for Hadoop?
>
>
>
> Jeff
>
>
>
>
>
-----Inline Attachment Follows-----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file.
-->
<configuration>
<!--- global properties -->
<property>
<name>mapred.system.dir</name>
<value>/u1/cloud-data/mapred/system</value>
<description>The shared directory where MapReduce
stores control
files.
</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>/u1/cloud-data/mapred/local</value>
<description>The local directory where MapReduce
stores intermediate
data files. May be a comma-separated list of
directories on different devices in order to spread disk
i/o.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>mapred.job.tracker.info.port</name>
<value>50030</value>
<description>The port that the MapReduce job tracker
info webserver
runs at.
</description>
</property>
<property>
<name>dfs.secondary.info.port</name>
<value>50090</value>
<description>The base number for the Secondary
namenode info port.
</description>
</property>
<property>
<name>dfs.datanode.port</name>
<value>50010</value>
<description>The port number that the dfs datanode
server uses as a
starting
point to look for a free port to listen on.
</description>
</property>
<property>
<name>dfs.info.port</name>
<value>50070</value>
<description>The base port number for the dfs
namenode web ui.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<description>A base for other temporary
directories.</description>
</property>
<!-- file system properties -->
<property>
<name>fs.default.name</name>
<value>hdfs://cu027.cubit.sp.collab.net:54310</valu
e>
<description>The name of the default file system.
A URI whose
scheme and authority determine the FileSystem
implementation. The
uri's scheme determines the config property
(fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority
is used to
determine the host, port, etc. for a filesystem.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/u1/cloud-data/dfs/name</value>
<description>Determines where on the local
filesystem the DFS name
node
should store the name table. If this is a
comma-delimited list
of directories then the name table is replicated in
all of the
directories, for redundancy. </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/u1/cloud-data/dfs/data</value>
<description>Determines where on the local
filesystem an DFS data
node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all
named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>0</value>
<description>Reserved space in bytes per volume.
Always leave this
much space free for non dfs use.
</description>
</property>
<property>
<name>dfs.datanode.du.pct</name>
<value>0.50f</value>
<description>When calculating remaining space, only
use this
percentage of the real available space
</description>
</property>
<property>
<name>dfs.replication</name>
<value>4</value>
<description>Default block replication.
The actual number of replications can be specified when
the file is
created.
The default is used if replication is not specified in
create time.
</description>
</property>
<!-- map/reduce properties -->
<property>
<name>mapred.job.tracker</name>
<value>cu063.cubit.sp.collab.net:54311</value>
<description>The host and port that the MapReduce
job tracker runs
at. If "local", then jobs are run in-process
as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>31</value>
<description>The default number of map tasks per
job. Typically set
to a prime several times greater than number of available
hosts.
Ignored when mapred.job.tracker is "local".
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>11</value>
<description>The default number of reduce tasks per
job. Typically
set
to a prime close to the number of available hosts.
Ignored when
mapred.job.tracker is "local".
</description>
</property>
</configuration>
|
|
| RE: Platform reliability with Hadoop |

|
2008-01-21 13:53:55 |
I should add, when I run my job it indicates it could not
find the input
files:
[jeastman cu027 jeastman]$ $HADOOP_INSTALL/bin/hadoop jar
~/access0.jar
com.collabnet.hadoop.access.Access0Driver ecn/access
ecn-out
08/01/21 10:59:39 INFO mapred.FileInputFormat: Total input
paths to
process: 0
08/01/21 10:59:46 INFO mapred.JobClient: Running job:
job_200801182307_0005
08/01/21 10:59:47 INFO mapred.JobClient: map 100% reduce
100%
08/01/21 10:59:48 INFO mapred.JobClient: Job complete:
job_200801182307_0005
08/01/21 10:59:49 INFO mapred.JobClient: Counters: 0
I tried using full paths for them (/users/jeastman/...) but
that throws
'input path does not exist' errors.
Jeff
-----Original Message-----
From: Jeff Eastman [mailto:jeastman collab.net]
Sent: Monday, January 21, 2008 11:15 AM
To: hadoop-user lucene.apache.org
Subject: RE: Platform reliability with Hadoop
Is it really that simple?
The Wiki page GettingStartedWithHadoop recommends setting
dfs.name.dir,
dfs.data.dir, dfs.client.buffer.dir and mapred.local.dir to
"appropriate" values (without giving an example).
Should these be fixed
(XX) or variable (XX-${user.name}) values? The FAQ page
recommends
setting the mapred.system.dir to a fixed value (e.g.
/hadoop/mapred/system), so I chose fixed values too.
- dfs.name.dir - /u1/cloud-data
- dfs.data.dir - /u1/cloud-data
- mapred.system.dir - /u1/cloud-data
- mapred.local.dir - /u1/cloud-data
I did not overwrite the dfs.client.buffer.dir (Determines
where on the
local filesystem an DFS client should store its blocks
before it sends
them to the datanode) because my 'jeastman' client could not
put data
into the dfs with it set to the fixed value.
There are 4 other settings that use the ${hadoop.tmp.dir},
and these
seem appropriately tmp-ish. I did not redefine them:
- fs.trash.root - The trash directory, used by FsShell's
'rm' command.
- fs.checkpoint.dir - Determines where on the local
filesystem the DFS
secondary name node should store the temporary images and
edits to
merge.
- fs.s3.buffer.dir - Determines where on the local
filesystem the S3
filesystem should store its blocks before it sends them to
S3 or after
it retrieves them from S3.
- mapred.temp.dir - A shared directory for temporary files.
Jeff
-----Original Message-----
From: lohit.vijayarenu yahoo.com [mailto:lohit.vijayarenu yahoo.com]
Sent: Sunday, January 20, 2008 11:44 AM
To: hadoop-user lucene.apache.org
Subject: Re: Platform reliability with Hadoop
you might want to change hadoop.tmp.dir entry alone. since
others are
derived out of this, everything should be fine.
i am wondering if hadoop.tmp.dir might be used elsewhere
thanks,
lohit
----- Original Message ----
From: Jeff Eastman <jeastman collab.net>
To: hadoop-user lucene.apache.org
Sent: Sunday, January 20, 2008 11:05:28 AM
Subject: RE: Platform reliability with Hadoop
I am almost operational again but something in my
configuration is
still
not quite right. Here's what I did:
- I created a directory /u1/cloud-data on every machine's
local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir
entries for:
- mapred.system.dir
- mapred.local.dir
- dfs.name.dir
- dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all
machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as
'jeastman'
- When I submit a job as 'jeastman' that used to run, it
runs but
cannot
locate any input data so it quits immediately with this in
the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:500
30/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:
I've attached my site.xml file.
Jeff
-----Original Message-----
From: Jason Venner [mailto:jason attributor.com]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user lucene.apache.org
Subject: Re: Platform reliability with Hadoop
The /tmp default has caught us once or twice too. Now we put
the files
elsewhere.
lohit.vijayarenu yahoo.com wrote:
>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally
reboot and
reprofile
them
>>
>
> Wont you lose your blocks after reboot since /tmp gets
cleaned up?
Could this be the reason you see data corruption?
> Good idea is to configure DFS to be any place other
than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman collab.net>
> To: hadoop-user lucene.apache.org
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently,
0.15.2 on a dozen
> machines in our CUBiT array for the last month. During
this time I
have
> experienced two major data corruption losses on
relatively small
> amounts
> of data (<50gb) that make me wonder about the
suitability of this
> platform for hosting Hadoop. CUBiT is one of our
products for
managing
> a
> pool of development servers, allowing developers to
check out
machines,
> install various OS profiles on them and monitor their
utilization via
> the web. With most machines reporting very low
utilization it seemed
a
> natural place to run Hadoop in the background. I have
an NFS-mounted
> account on all of the machines and have installed
Hadoop there. The
DFS
> is stored in /tmp on each box. The developers who own
the machines
> occasionally reboot and reprofile them, but this occurs
infrequently
> and
> does not clobber /tmp. Hadoop is designed to deal with
slave failures
> of
> this nature, though this platform may well be an acid
test.
>
>
>
> My initial cloud was configured for replication factor
of 3 and I
have
> increased that now to 4 in hopes of improving data
reliability in the
> face of these more-prevalent slave outages. Ted Dunning
has suggested
> aggressive rebalancing in his recent posts and I have
done this by
> increasing replication to 5 (from 3) and then dropping
it to 4. Are
> there other rebalancing or configuration techniques
that might
improve
> my data reliability? Or, is this platform just too
unstable to be a
> good
> fit for Hadoop?
>
>
>
> Jeff
>
>
>
>
>
-----Inline Attachment Follows-----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file.
-->
<configuration>
<!--- global properties -->
<property>
<name>mapred.system.dir</name>
<value>/u1/cloud-data/mapred/system</value>
<description>The shared directory where MapReduce
stores control
files.
</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>/u1/cloud-data/mapred/local</value>
<description>The local directory where MapReduce
stores intermediate
data files. May be a comma-separated list of
directories on different devices in order to spread disk
i/o.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>mapred.job.tracker.info.port</name>
<value>50030</value>
<description>The port that the MapReduce job tracker
info webserver
runs at.
</description>
</property>
<property>
<name>dfs.secondary.info.port</name>
<value>50090</value>
<description>The base number for the Secondary
namenode info port.
</description>
</property>
<property>
<name>dfs.datanode.port</name>
<value>50010</value>
<description>The port number that the dfs datanode
server uses as a
starting
point to look for a free port to listen on.
</description>
</property>
<property>
<name>dfs.info.port</name>
<value>50070</value>
<description>The base port number for the dfs
namenode web ui.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<description>A base for other temporary
directories.</description>
</property>
<!-- file system properties -->
<property>
<name>fs.default.name</name>
<value>hdfs://cu027.cubit.sp.collab.net:54310</valu
e>
<description>The name of the default file system.
A URI whose
scheme and authority determine the FileSystem
implementation. The
uri's scheme determines the config property
(fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority
is used to
determine the host, port, etc. for a filesystem.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/u1/cloud-data/dfs/name</value>
<description>Determines where on the local
filesystem the DFS name
node
should store the name table. If this is a
comma-delimited list
of directories then the name table is replicated in
all of the
directories, for redundancy. </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/u1/cloud-data/dfs/data</value>
<description>Determines where on the local
filesystem an DFS data
node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all
named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>0</value>
<description>Reserved space in bytes per volume.
Always leave this
much space free for non dfs use.
</description>
</property>
<property>
<name>dfs.datanode.du.pct</name>
<value>0.50f</value>
<description>When calculating remaining space, only
use this
percentage of the real available space
</description>
</property>
<property>
<name>dfs.replication</name>
<value>4</value>
<description>Default block replication.
The actual number of replications can be specified when
the file is
created.
The default is used if replication is not specified in
create time.
</description>
</property>
<!-- map/reduce properties -->
<property>
<name>mapred.job.tracker</name>
<value>cu063.cubit.sp.collab.net:54311</value>
<description>The host and port that the MapReduce
job tracker runs
at. If "local", then jobs are run in-process
as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>31</value>
<description>The default number of map tasks per
job. Typically set
to a prime several times greater than number of available
hosts.
Ignored when mapred.job.tracker is "local".
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>11</value>
<description>The default number of reduce tasks per
job. Typically
set
to a prime close to the number of available hosts.
Ignored when
mapred.job.tracker is "local".
</description>
</property>
</configuration>
|
|
[1-3]
|
|