It's easy to use a cluster the wrong way, so here are some tips on how to do it right.
See also the user-provided documentation:
- ⇒ Beginners Guide.
- ⇒ Cluster Quickstart.
- ⇒ A collection of useful scripts, templates, and examples for clusters using SLURM.
- ⇒ Further templates for running experiments with slurm.
Please note these external links were accurate on Dec 2nd 2019. Many thanks to Christine and James for providing these.
Head Nodes
Each cluster has one or more head nodes. A head node acts as a gateway between the cluster and everywhere else - users login to it so that they can access the distributed filesystem and submit jobs using the job scheduler (usually Slurm).
- ⇒ Slurm.
Please do not run jobs on the head node! That's not what it's for! Instead, submit jobs on the head node, using the cluster's job scheduling software. The jobs will then be allocated a cluster node to run on. If you are running processes on the head node please use the UNIX nice command to try to make the node more responsive.
Files - places
If your jobs use files in the wrong location, they could run far more slowly than they need to, or stop everyone from using the cluster.
Here's where files can be stored:
- The distributed filesystem. Each user has a directory on the distributed filesystem, like
/home/username
(for example/home/s2345678
). It's accessible from every node in the cluster, but it's not fast to use. The distributed filesystem is coordinated by a filesystem node. The files themselves are distributed all over the cluster (hence the name), but the filesystem node gives them the appearance of a single unified filespace. It's good for getting files to and from the computing nodes. It's a bad place for your computing jobs to use as working space for reading or writing data, because it's too slow, and because that can overload the filesystem node. Instead, it's best to copy your code and input data from the distributed filesystem to scratch space at the start of a job, then copy output data back from scratch space to the distributed filesystem at the end of the job.- ⇒ The current distributed filesystem is Lustre (wikipedia).
- ⇒ Some users are still using the previous GlusterFS (wikipedia) filesystem.
- ⇒ The current distributed filesystem is Lustre (wikipedia).
- Scratch space is local disk space. Each node has its own local scratch space, and each node can only access its own scratch space. Scratch space is faster than the distributed filesystem because it's always local to the machine. It's usually named
/disk/scratch
but some nodes have/disk/scratch1
or/disk/scratch2
or/disk/scratch_big
or/disk/scratch_fast
. Check the computing.help page for your cluster to see what's available. Your job should copy its code and input data from the distributed filesystem to scratch space at the start, then copy its output data from scratch space to distributed filesystem when it finishes. It should also then delete all of your files from scratch space when you are finished using it, so that other people can use it after you. If you are running multiple jobs on the same node or are likely to then it's acceptable to leave the data on between jobs rather than copying it across every time. Scratch space is not backed up and it may be cleared periodically, so don't store files there and expect them to still be there months later. - AFS is where home directories (and group space) are housed on DICE machines. Most cluster nodes cannot access AFS. In a cluster, it can only be accessed from the cluster's head node. You can login to the head node in order to copy code and data between AFS and the distributed filesystem - for instance so that you have backup copies. AFS is backed up nightly, so your files there can be retrieved in the event of systems failure.
Files - small batches please
When copying files from the distributed filesystem to scratch space (or vice versa), copy them in small batches (1000 files or less). In tests, copying 10000 small files took roughly 70 times as long as copying 1000 files, instead of the expected 10 times as long - a huge performance loss!
Whenever you tie up the distributed filesystem by trying to access large numbers of files, it becomes effectively unavailable to anyone else.
So when using the distributed filesystem, it's a good idea to access relatively few files at once. One way to do this might be to amalgamate your files using a utility such as zip
, then copy the zip file to the distributed filesystem, then have your job copy that zip file to scratch space and unzip it. At the end of a job, output files could be zipped together, then the zip file copied to the distributed filesystem. You can then copy the zip file onwards to (for example) AFS, where you can unzip it.
Files - best practice for jobs
- Put the code into the distributed filesystem.
- Put the data used as the input for your model into the distributed filesystem.
- At the beginning of your sbatch job, copy the data used as the input for your model to the scratch disk of the node you have been assigned to.
- You do this because reading from the distributed filesystem regularly is slow, and most of the models we run do read very regularly from the input data.
- The data can be copied with rsync -u (see man rsync for details).
- Save any outputs you need (such as model checkpoints) to the scratch disk of the node you have been assigned to. The reason for this is that sometimes we need to regularly write small files to disk for logging.
- At the end of your sbatch job, copy the output data to the distributed filesystem.
Files - deletion
Please delete your files from scratch space as soon as you can - for example, at the end of each batch job, after the job has copied your output files to the distributed filesystem - to make room for others to use it.
From time to time, files are deleted from scratch space if they haven't been accessed for a while, to make room for newer ones.
You may find that your files have been deleted from scratch space but your directories have been left intact.
Jobs - best practice
Please set resource limits for your jobs. The --mem
and --time
options to sbatch
do this. For details type man sbatch
.
Set the maximum memory to at least 1000MB less than the node's memory divided by the number of GPUs in the node, and set the maximum time to a few hours (but if --time
of a day or more must be used, then be conscious of not using too much resource).
Make your jobs short and fault tolerant. Cluster nodes can fail - for example memory or disk space can be exhausted, or hardware can fail. When they do, and your job fails as a result, you don't want to have to start your job again at the beginning.
Instead of one long job, have a series of short jobs, and have them write output regularly. That way, when a job fails, you won't have to go back very far.
For example, instead of writing one job which does 100 epochs, each of which takes an hour, write a job which just loads the previous epoch’s parameters, performs one epoch, then saves the parameters and exits. This way, random failure will not affect you much.
Interactive Jobs
When you have finished using an interactive session on a GPU server, please quit it, so that other people can have an interactive session. GPU nodes should never be left idle, because other people need to use them too.
Array Jobs
Use array jobs where you can. These make it much easier to control the number of concurrent jobs manually if required, and they're a much better, more explicit way of performing something like a grid search.
Here's an example of how to use array jobs, courtesy of James Owers (thanks James!):
- Create a text file (
commands.txt
) containing all the commands you want to run. - In the bash script to pass to slurm (
slurm_array_template.sh
) use$SLURM_ARRAY_TASK_ID
to select the line ofcommands.txt
- for example:
COMMAND="`sed \"${SLURM_ARRAY_TASK_ID}q;d\" $1`"
where
$1
is theexperiments.txt
(passed as an argument to the .sh script). eval "$COMMAND"
inslurm_array_template.sh
- Submit the experiment to
sbatch
like this:
EXPT_FILE=commands.txt NR_EXPTS=`cat ${EXPT_FILE} | wc -l ` MAX_PARALLEL_JOBS=30 sbatch --array=1-${NR_EXPTS}%${MAX_PARALLEL_JOBS} slurm_array_template.sh $EXPT_FILE
- You can adjust the number of concurrent jobs executing in parallel using, for example:
scontrol update ArrayTaskThrottle=16 JobId=475856
GPUs - listing GPU types
Different clusters and even different nodes in a cluster can have different specifications of GPU. You can list the type and number of each GPU available, and you can select specific types of GPU when submitting a job.
[uhtred]bsmith5: sinfo -N -O partition,nodelist:14,gres:60,cpus:10,memory:10 PARTITION NODELIST GRES CPUS MEMORY Teach-Interactive landonia01 gpu:titan-x:3,gpu:gtx-1060:2 12 96000 [...] Teach-LongJobs landonia21 gpu:gtx-1060:8 12 96000 Teach-LongJobs landonia22 gpu:gtx-1060:8 12 96000 Teach-LongJobs landonia23 gpu:gtx-1060:8 12 96000 Teach-LongJobs landonia24 gpu:gtx-1060:8 12 96000 Teach-LongJobs landonia25 gpu:gtx-1060:8 12 96000 General_Usage letha06 gpu:gtx-1080:8 40 128000 General_Usage meme gpu:titan-x:4 32 128000
This shows the GPUs, CPUs and CPU memory (megabytes) available to jobs on each node. You can see nodes that have different types of GPU and nodes that have multiple types of GPU themselves. You can choose a particular GPU type by modifying your gres option: --gres=gpu:titan-x:1
Here's a suggestion how to check the memory capacity of a GPU you're planning on using:
srun --partition=General_Usage --gres=gpu:titan-x:1 nvidia-smi --query-gpu=name,memory.total --format=csv
Alternatively, you can look up the GPUs in a list of GPU specifications [wikipedia.org].
Summary of GPU memory
gres | Name | Memory Total (MiB) |
a100_80gb | NVIDIA A100 80GB PCIe | 81920 |
a40 | NVIDIA A40 | 46068 |
a6000 | NVIDIA RTX A6000 | 49140 |
gtx-1060 | NVIDIA GeForce GTX 1060 6GB | 6144 |
gtx_1080_ti | NVIDIA GeForce GTX 1080 Ti | 11264 |
gtx-2080ti, rtx_2080_ti | NVIDIA GeForce RTX 2080 Ti | 11264 |
titan-x, titan_x | NVIDIA GeForce GTX TITAN X | 12288 |
titan_xp | NVIDIA TITAN Xp | 12288 |
titan-x-pascal,titan_x_pascal | NVIDIA TITAN X (Pascal) | 12288 |