cluster-tools

cluster-tools is a set of two scripts, cluster-run and cluster-kill which I use to run jobs on our institute's cluster of Linux machines. cluster-run is a Python script which keeps polling a set of machines and starts new jobs once CPUs become free. cluster-kill is a trivial bash script which remotely runs killall some-program on a given set of machines.

This package is free software licensed under the GNU GPL, version 2 or later.

Installation

Use Git to clone the repository:

git clone http://falma.de/git/cluster-tools.git

(You can also browse the Git repository online.)

Should you encounter problems please let me know.

cluster-run

Usage

Consult cluster-run –help for an explanation of all the options. It will print

Usage: cluster-run [options] MACHINE [MACHINE...]
Description:
  A list of commands to be executed is read from stdin.  The commands are
  executed via ssh on the remote machines in the same directory as the working
  directory on the local machine (we expect to run on a cluster with a shared
  file system).  Unless the option –no-redirect is used, standard output is
  redirected to a file called DATE_MACHINE.dat and standard error is
  redirected to a file called DATE_MACHINE.err.  The first line of each .dat
  file is generated by cluster-run and has the form
  # Output of <command>
  Empty .err-files are deleted.
Options:
  -c, –max-children=N  Launch maximally N child processes.  Defaults to 10.
  -m N, –min-free=N    Launch jobs only on machines with at least N free CPUs.
  -n, –no-redirect     Do not redirect standard output / error automatically.
  -p N, –passes=N      Run the commands read from stdin N times.  Default is 1.
  -r, –rest=T          Minimum time in seconds between starting jobs on the
                        same machine.  Defaults to 5.
  -h, –help            Print this message and exit.
  -V, –version         Print information about version, author and license.
Examples:
  cluster-run <commands machine1 machine2 machine3 machine4
  echo a.out | cluster-run -p100 clust{0..9}

Running the same command many times

Suppose you have 10 machines named clust0 to clust9 you can ssh into and they all share the same /home file system. You have a program called a.out which does some Monte-Carlo calculations. It initializes its random number generator with real random numbers from /dev/urandom and outputs the result to stdout. You would like to run a.out 1000 times and then average the results (I often use anadat for that). This can be achieved by executing

echo a.out | cluster-run –passes=1000 clust{0..9}

(clust{0..9} is a useful bash shortcut for clust0 clust1 ... clust9)

cluster-run will redirect the standard output of each job into a file named MACHINE-TIME.out (e.g. 20090906-181705_clust4.dat) and the standard error to MACHINE-TIME.err. It will make sure to never start two jobs in the same second on the same machine. The first line of each .dat-file will have the form

# Output of <command> <parameter1> <parameter2> ...

If the .err-file is empty upon completion of the job, it is deleted.

Running a command with changing parameters

The reason why the commands to be executed are passed to cluster-run via stdin is to allow for many different commands to be passed at the same time. Each input line is a command to be executed the number of times specified by the option –passes (whose default value is 1).

Now imagine you want to run b.out 1000 times giving it a numerical parameter which ranges from 1 to 1000. This means we have to run 1000 different commands and we will pass them to cluster-run's stdin.

for i in `seq 1000`; do echo "b.out "; done | cluster-run clust{0..9}

This will work but the names of the output files will still follow the scheme MACHINE-TIME giving no hint about the used parameter values. However, you can recover this information from the first line of each output file.

If you prefer to name the output files according to a custom scheme use the option –no-redirect and take care of the redirections yourself, e.g.

for i in `seq 1000`; do echo "b.out  >.out 2>.err"; done |
cluster-run –no-redirect clust{0..9}

Do not forget to redirect stderr.

Running multi-threaded programs

Nothing changes if you run programs which can utilize more than one CPU per machine. If a program is able to use all the processing power of a computer cluster-run detects this and does not run more than one job per machine.

In some cases a program might only be able to use all the CPUs after an initialization phase. It would be unfavorable if cluster-run would start additional jobs during that phase. In such cases the option –rest=N can be used to tell cluster-run to leave a machine alone for N seconds after starting a job on it.

cluster-kill

The script is trivial but I find it useful. Just execute

cluster-kill a.out clust{0..9}

to execute killall a.out via ssh on the machines clust0 to clust9.

History

cluster-tools was written and is maintained by Christoph Groth.