cluster-tools
cluster-tools is a set of two scripts, cluster-run
and cluster-kill
which I
use to run jobs on our institute's cluster of Linux machines. cluster-run
is
a Python script which keeps polling a set of machines and
starts new jobs once CPUs become free. cluster-kill
is a trivial bash script
which remotely runs killall some-program
on a given set of machines.
This package is free software licensed under the GNU GPL, version 2 or later.
Installation
Use Git to clone the repository:
git clone http://falma.de/git/cluster-tools.git
(You can also browse the Git repository online.)
Should you encounter problems please let me know.
cluster-run
Usage
Consult cluster-run help
for an explanation of all the options. It will
print
Usage: cluster-run [options] MACHINE [MACHINE...]
Description:
A list of commands to be executed is read from stdin. The commands are
executed via ssh on the remote machines in the same directory as the working
directory on the local machine (we expect to run on a cluster with a shared
file system). Unless the option no-redirect is used, standard output is
redirected to a file called DATE_MACHINE.dat and standard error is
redirected to a file called DATE_MACHINE.err. The first line of each .dat
file is generated by cluster-run and has the form
# Output of <command>
Empty .err-files are deleted.
Options:
-c, max-children=N Launch maximally N child processes. Defaults to 10.
-m N, min-free=N Launch jobs only on machines with at least N free CPUs.
-n, no-redirect Do not redirect standard output / error automatically.
-p N, passes=N Run the commands read from stdin N times. Default is 1.
-r, rest=T Minimum time in seconds between starting jobs on the
same machine. Defaults to 5.
-h, help Print this message and exit.
-V, version Print information about version, author and license.
Examples:
cluster-run <commands machine1 machine2 machine3 machine4
echo a.out | cluster-run -p100 clust{0..9}
Running the same command many times
Suppose you have 10 machines named clust0
to clust9
you can ssh into and
they all share the same /home file system. You have a program called a.out
which does some Monte-Carlo
calculations. It initializes its random number generator with real random
numbers from /dev/urandom
and outputs the result to stdout. You would like
to run a.out
1000 times and then average the results (I often use
anadat for that). This can be achieved by executing
echo a.out | cluster-run passes=1000 clust{0..9}
(clust{0..9}
is a useful bash shortcut for clust0 clust1 ... clust9
)
cluster-run
will redirect the standard output of each job into a file named
MACHINE-TIME.out
(e.g. 20090906-181705_clust4.dat
) and the standard error
to MACHINE-TIME.err
. It will make sure to never start two jobs in the same
second on the same machine. The first line of each .dat-file will have the
form
# Output of <command> <parameter1> <parameter2> ...
If the .err-file is empty upon completion of the job, it is deleted.
Running a command with changing parameters
The reason why the commands to be executed are passed to cluster-run via stdin
is to allow for many different commands to be passed at the same time. Each
input line is a command to be executed the number of times specified by the
option passes
(whose default value is 1).
Now imagine you want to run b.out
1000 times giving it a numerical parameter
which ranges from 1 to 1000. This means we have to run 1000 different commands
and we will pass them to cluster-run's stdin.
for i in `seq 1000`; do echo "b.out "; done | cluster-run clust{0..9}
This will work but the names of the output files will still follow the scheme
MACHINE-TIME
giving no hint about the used parameter values. However, you
can recover this information from the first line of each output file.
If you prefer to name the output files according to a custom scheme use the
option no-redirect
and take care of the redirections yourself, e.g.
for i in `seq 1000`; do echo "b.out >.out 2>.err"; done |
cluster-run no-redirect clust{0..9}
Do not forget to redirect stderr.
Running multi-threaded programs
Nothing changes if you run programs which can utilize more than one CPU per machine. If a program is able to use all the processing power of a computer cluster-run detects this and does not run more than one job per machine.
In some cases a program might only be able to use all the CPUs after an
initialization phase. It would be unfavorable if cluster-run would start
additional jobs during that phase. In such cases the option rest=N
can be
used to tell cluster-run to leave a machine alone for N seconds after starting
a job on it.
cluster-kill
The script is trivial but I find it useful. Just execute
cluster-kill a.out clust{0..9}
to execute killall a.out
via ssh
on the machines clust0
to clust9
.
History
cluster-tools was written and is maintained by Christoph Groth.
- 2007-01-21: Version 0.1
- 2008-12-08: Version 0.2
- 2009-10-27: Version 0.3
- to-be-released: Version 0.4