|
List Info
Thread: Cross Platform Administration and Deployment for Nutch and Hadoop
|
|
| Cross Platform Administration and
Deployment for Nutch and Hadoop |

|
2007-01-23 14:06:18 |
All,
We are starting to design and develop a framework in Python
that will
better automate different pieces of Nutch and Hadoop
administration and
deployment and we wanted to get community feedback.
We first want to replace the DFS and MapReduce startup
scripts with
python alternatives. All of the features below are assumed
to be
executed from a central location (for example the namenode).
These
script would allow expanded functionality that would
include:
1) Start, stop, restart of individual DFS and MapReduce
nodes. (This
would be able to start and stop the namenode and jobtracker
as well but
would first check to see if data/task nodes were running and
take
appropriate action.)
2) Start, stop, restart dfs or map reduce cluster
independently.
3) Start, stop, restart the entire dfs and map reduce
cluster.
4) Allow for individual data/job nodes to have different
deployment
locations. This is necessary for a heterogeneous OS
cluster.
5) Allow cross platform and heterogeneous OS clusters.
6) Get detailed status of individual nodes or all nodes.
This would
include items such as disk space, cpu usage, etc.
7) Reboot or shutdown machines. Again this would take into
account
running services.
Next we would like to split the tools for nutch (such as
crawl,
invertlinks, etc.) and the tools for hadoop (dfs, job, etc.)
into their
own individual python scripts that would allow the following
functionality.
1) Dynamic configuration of variables or resetting of config
directory.
This might need to be enhanced with changes to the
configuration
classes in Hadoop (don know yet).
2) Dynamically set other variables such as java heap space
and log file
directories.
We already have a script that automates a continual fetching
process in
user defined blocks of number of urls. This script handles
the entire
process of injecting, generating, fetching, updating db,
merging
segments and crawl databases and looping and doing the next
fetch and so
on until a stop command is given.
Next, we want to create python script that will automate
deployment to
various nodes and perform maintenance tasks. Unless
otherwise stated
the scripts would be able to deploy to different deployment
locations
configured per machine and allow deployment to an individual
machine, a
list of machines, or all machines in the cluster. It would
also allow
either the backing up or removal of old items. This would
include:
1) Deploy new release, all code and files.
2) Deploy all lib files.
3) Deploy all conf.
4) Deploy a single file.
5) Deploy all bin files.
6) Deploy a single plugin.
7) Deploy the Nutch job and jar files.
8) Deploy all plugins
9) Remove all log files or archive to a given location.
Finally we would like to automate search index deployment
and
administration. Unless otherwise stated the scripts would
be able to
deploy to target different locations configured per machine
and allow
targeting to an individual machine, a list of machines, or
all machines
in the cluster. This functionality would include:
1) Configure a cluster of search servers.
2) Deploy, remove, and redeploy index pieces (parts of an
index) to
search servers.
3) Start, Stop, and restart search servers.
We would have detailed help screens and and fully documented
scripts
using a common framework of scripts. If it was designed
correctly we
could setup job streams that did automatic crawls,
re-crawls,
integrations, indexing, and deployments to search servers.
All of which
would be needed for the ongoing operation of a web search
engine.
There is a catch and that is that this functionality would
require
python to be installed on at least the controller node.
This would be a push to the machines and it would be
implemented in
python using pexpect and probably implementing commands
through ssh, etc.
If you have thoughts on this please let us know and we will
see if we
can integrate the requests into the development.
Dennis Kubes
|
|
| Re: Cross Platform Administration and
Deployment for Nutch and Hadoop |

|
2007-01-24 03:59:49 |
|
| Dennis Kubes wrote:
> All,
>
> We are starting to design and develop a framework in Python that will
> better automate different pieces of Nutch and Hadoop administration
> and deployment and we wanted to get community feedback.
>
> We first want to replace the DFS and MapReduce startup scripts with
> python alternatives. All of the features below are assumed to be
> executed from a central location (for example the namenode). These
> script would allow expanded functionality that would include:
>
> 1) Start, stop, restart of individual DFS and MapReduce nodes. (This
> would be able to start and stop the namenode and jobtracker as well
> but would first check to see if data/task nodes were running and take
> appropriate action.)
> 2) Start, stop, restart dfs or map reduce cluster independently.
> 3) Start, stop, restart the entire dfs and map reduce cluster.
> 4) Allow for individual data/job nodes to have different deployment
> locations. This is necessary for a heterogeneous OS cluster.
> 5) Allow cross platform and heterogeneous OS clusters.
> 6) Get detailed status of individual nodes or all nodes. This would
> include items such as disk space, cpu usage, etc.
> 7) Reboot or shutdown machines. Again this would take into account
> running services.
>
Well, i think heterogeneous OS cluster might not be a good idea.
Managing more both linux and win in one script might be tricky.
> Next we would like to split the tools for nutch (such as crawl,
> invertlinks, etc.) and the tools for hadoop (dfs, job, etc.) into
> their own individual python scripts that would allow the following
> functionality.
>
> 1) Dynamic configuration of variables or resetting of config
> directory. This might need to be enhanced with changes to the
> configuration classes in Hadoop (don know yet).
> 2) Dynamically set other variables such as java heap space and log
> file directories.
You dont have to change Configuration classes. Each runnable class in
nutch (except crawl) extends ToolBase, which allows -conf
argument.
Java heap space is configurable from bin/nutch, so a little modification
will work. NUTCH_LOG_DIR is read from environment in bin/nutch.
>
> We already have a script that automates a continual fetching process
> in user defined blocks of number of urls. This script handles the
> entire process of injecting, generating, fetching, updating db,
> merging segments and crawl databases and looping and doing the next
> fetch and so on until a stop command is given.
>
> Next, we want to create python script that will automate deployment to
> various nodes and perform maintenance tasks. Unless otherwise stated
> the scripts would be able to deploy to different deployment locations
> configured per machine and allow deployment to an individual machine,
> a list of machines, or all machines in the cluster. It would also
> allow either the backing up or removal of old items. This would include:
>
> 1) Deploy new release, all code and files.
> 2) Deploy all lib files.
> 3) Deploy all conf.
> 4) Deploy a single file.
> 5) Deploy all bin files.
> 6) Deploy a single plugin.
> 7) Deploy the Nutch job and jar files.
> 8) Deploy all plugins
> 9) Remove all log files or archive to a given location.
Well, unfortunately, there are lots of issues in updating the current
codebase. Lots of manual testing should be done. Sometimes a new
features includes bugs. And previous files become incompatible. From my
experience, automating such tasks is not straightforward.
>
> Finally we would like to automate search index deployment and
> administration. Unless otherwise stated the scripts would be able to
> deploy to target different locations configured per machine and allow
> targeting to an individual machine, a list of machines, or all
> machines in the cluster. This functionality would include:
>
> 1) Configure a cluster of search servers.
> 2) Deploy, remove, and redeploy index pieces (parts of an index) to
> search servers.
> 3) Start, Stop, and restart search servers.
Well, this can be handy. I have written a script which uses
start-stop-daemon to start-stop the index servers as a background
process. The script also checks if the status of the server. But what is
really needed in nutch is dynamically adding and removing a bunch of
index servers w/o effecting the front-end.
>
> We would have detailed help screens and and fully documented scripts
> using a common framework of scripts. If it was designed correctly we
> could setup job streams that did automatic crawls, re-crawls,
> integrations, indexing, and deployments to search servers. All of
> which would be needed for the ongoing operation of a web search engine.
>
> There is a catch and that is that this functionality would require
> python to be installed on at least the controller node.
> This would be a push to the machines and it would be implemented in
> python using pexpect and probably implementing commands through ssh, etc.
>
I think python dependency wont be a problem if this management script is
well enough.
A web interface along with the python interface would be great i think.
> If you have thoughts on this please let us know and we will see if we
> can integrate the requests into the development.
>
> Dennis Kubes
>
Good luck with that, i will be looking forward to see it.
|
[1-2]
|
|