You are here

cluster tips

Printer-friendly versionPrinter-friendly version

It's easy to use a cluster the wrong way, so here are some tips on how to do it right.
See also the user-provided documentation:

Please note these external links were accurate on Dec 2nd 2019. Many thanks to Christine and James for providing these.

Head Nodes

Each cluster has one or more head nodes. A head node acts as a gateway between the cluster and everywhere else - users login to it so that they can access the distributed filesystem and submit jobs using the job scheduler (usually Slurm).

Please do not run jobs on the head node! That's not what it's for! Instead, submit jobs on the head node, using the cluster's job scheduling software. The jobs will then be allocated a cluster node to run on.

Files - places

If your jobs use files in the wrong location, they could run far more slowly than they need to, or stop everyone from using the cluster.
Here's where files can be stored:

  • The distributed filesystem. Each user has a directory on the distributed filesystem, like /home/username (for example /home/s2345678). It's accessible from every node in the cluster, but it's not fast to use. The distributed filesystem is coordinated by a filesystem node. The files themselves are distributed all over the cluster (hence the name), but the filesystem node gives them the appearance of a single unified filespace. It's good for getting files to and from the computing nodes. It's a bad place for your computing jobs to use as working space for reading or writing data, because it's too slow, and because that can overload the filesystem node. Instead, it's best to copy your code and input data from the distributed filesystem to scratch space at the start of a job, then copy output data back from scratch space to the distributed filesystem at the end of the job.
  • Scratch space is local disk space. Each node has its own local scratch space, and each node can only access its own scratch space. Scratch space is faster than the distributed filesystem because it's always local to the machine. It's usually named /disk/scratch but some nodes have /disk/scratch1 or /disk/scratch2 or /disk/scratch_big or /disk/scratch_fast. Check the computing.help page for your cluster to see what's available. Your job should copy its code and input data from the distributed filesystem to scratch space at the start, then copy its output data from scratch space to distributed filesystem when it finishes. It should also then delete all of your files from scratch space, so that other people can use it after you. Scratch space is not backed up and it may be cleared periodically, so don't store files there and expect them to still be there months later.
  • AFS is where home directories (and group space) are housed on DICE machines. Most cluster nodes cannot access AFS. In a cluster, it can only be accessed from the cluster's head node. You can login to the head node in order to copy code and data between AFS and the distributed filesystem - for instance so that you have backup copies. AFS is backed up nightly, so your files there can be retrieved in the event of systems failure.

Files - small batches please

When copying files from the distributed filesystem to scratch space (or vice versa), copy them in small batches (1000 files or less). In tests, copying 10000 small files took roughly 70 times as long as copying 1000 files, instead of the expected 10 times as long - a huge performance loss!

Whenever you tie up the distributed filesystem by trying to access large numbers of files, it becomes effectively unavailable to anyone else.

So when using the distributed filesystem, it's a good idea to access relatively few files at once. One way to do this might be to amalgamate your files using a utility such as zip, then copy the zip file to the distributed filesystem, then have your job copy that zip file to scratch space and unzip it. At the end of a job, output files could be zipped together, then the zip file copied to the distributed filesystem. You can then copy the zip file onwards to (for example) AFS, where you can unzip it.

Files - best practice for jobs

  1. Put the ​code​ into the ​distributed filesystem.
  2. Put ​the data​ used as the ​input for your model​ into the distributed filesystem.
  3. At the beginning of your ​sbatch​ job, ​copy ​the ​data​ used as the input for your model​ to the ​scratch ​disk of the node you have been assigned to.
    • You do this because reading from the distributed filesystem regularly is slow, and most of the models we run do read very regularly from the input data.
    • The data can be copied with rsync -u (see man rsync for details).
  4. Save any outputs you need (such as model checkpoints) to the scratch ​disk of the node you have been assigned to. The reason for this is that sometimes we need to regularly write small files to disk for logging.
  5. At the end of your ​sbatch​ job, copy the output data to the ​distributed filesystem.

Files - deletion

Please delete your files from scratch space as soon as you can - for example, at the end of each batch job, after the job has copied your output files to the distributed filesystem - to make room for others to use it.

From time to time, files are deleted from scratch space if they haven't been accessed for a while, to make room for newer ones.
You may find that your files have been deleted from scratch space but your directories have been left intact.

Jobs - best practice

Please set resource limits for your jobs. The --mem and --time options to sbatch do this. For details type man sbatch.

Set the maximum memory to at least 1000MB less than the node's memory divided by the number of GPUs in the node, and set the maximum time to a few hours (but if --time of a day or more must be used, then be conscious of not using too much resource).

Make your jobs short and fault tolerant. Cluster nodes can fail - for example memory or disk space can be exhausted, or hardware can fail. When they do, and your job fails as a result, you don't want to have to start your job again at the beginning.

Instead of one long job, have a series of short jobs, and have them write output regularly. That way, when a job fails, you won't have to go back very far.

For example, instead of writing one job which does 100 epochs, each of which takes an hour, write a job which just loads the previous epoch’s parameters, performs one epoch, then saves the parameters and exits. This way, random failure will not affect you much.

Interactive Jobs

When you have finished using an interactive session on a GPU server, please quit it, so that other people can have an interactive session. GPU nodes should never be left idle, because other people need to use them too.

Array Jobs

Use array jobs where you can. These make it much easier to control the number of concurrent jobs manually if required, and they're a much better, more explicit way of performing something like a grid search.

Here's an example of how to use array jobs, courtesy of James Owers (thanks James!):

  1. Create a text file (commands.txt) containing all the commands you want to run.
  2. In the bash script to pass to slurm (slurm_array_template.sh) use $SLURM_ARRAY_TASK_ID to select the line of commands.txt - for example:
    COMMAND="`sed \"${SLURM_ARRAY_TASK_ID}q;d\" $1`"

    where $1 is the experiments.txt (passed as an argument to the .sh script).

  3. eval "$COMMAND" in slurm_array_template.sh
  4. Submit the experiment to sbatch like this:
    EXPT_FILE=commands.txt
    NR_EXPTS=`cat ${EXPT_FILE} | wc -l `
    MAX_PARALLEL_JOBS=30
    sbatch --array=1-${NR_EXPTS}%${MAX_PARALLEL_JOBS} slurm_array_template.sh $EXPT_FILE
    
  5. You can adjust the number of concurrent jobs executing in parallel using, for example:
    scontrol update ArrayTaskThrottle=16 JobId=475856
Last reviewed: 
01/09/2019

System Status

Home dirs (AFS)
Network
Mail
Other services
University services
Scheduled downtime

Choose a topic