GPU cluster tips

It's easy to use a cluster the wrong way, so here are some tips on how to do it right.
See also the user-provided documentation:

Many thanks to Christine and James for originally providing these in December 2019. They may have been updated subsequently but can't be guaranteed to be up-to-date.

Head Nodes

Each cluster has one or more head nodes. A head node acts as a gateway between the cluster and everywhere else - users login to it so that they can access the distributed filesystem and submit jobs using the job scheduler (usually Slurm).

⇒ Slurm.

Please do not run jobs on the head node! That's not what it's for! Instead, submit jobs on the head node, using the cluster's job scheduling software. The jobs will then be allocated a cluster node to run on. If you are running processes on the head node please use the UNIX nice command to try to make the node more responsive.

Cluster head nodes are inside the Informatics firewall. This means you need to use the School's OpenVPN or the SSH gateways, before making a connection onwards to the head nodes.

Networking

We have some separate notes on cluster networking. The RFC1918 restrictions apply to 'Teach-*' and 'PGR-*' nodes, but generally not to compute resources belonging to specific research groups.

Files - places

If your jobs use files in the wrong location, they could run far more slowly than they need to, or stop everyone from using the cluster.
Here's where files can be stored:

The distributed filesystem. Each user has a directory on the distributed filesystem, like /home/username (for example /home/s2345678). It's accessible from every node in the cluster, but it's not fast to use. The distributed filesystem is coordinated by a filesystem node. The files themselves are distributed all over the cluster (hence the name), but the filesystem node gives them the appearance of a single unified filespace. It's good for getting files to and from the computing nodes. It's a bad place for your computing jobs to use as working space for reading or writing data, because it's too slow, and because that can overload the filesystem node. Instead, it's best to copy your code and input data from the distributed filesystem to scratch space at the start of a job, then copy output data back from scratch space to the distributed filesystem at the end of the job.
- ⇒ The current distributed filesystem is Lustre (wikipedia).
- ⇒ Some users are still using the previous GlusterFS (wikipedia) filesystem.
Scratch space is local disk space. Each node has its own local scratch space, and each node can only access its own scratch space. Scratch space is faster than the distributed filesystem because it's always local to the machine. It's usually named /disk/scratch but some nodes have /disk/scratch1 or /disk/scratch2 or /disk/scratch_big or /disk/scratch_fast. Check the computing.help page for your cluster to see what's available. Your job should copy its code and input data from the distributed filesystem to scratch space at the start, then copy its output data from scratch space to distributed filesystem when it finishes. It should also then delete all of your files from scratch space when you are finished using it, so that other people can use it after you. If you are running multiple jobs on the same node or are likely to then it's acceptable to leave the data on between jobs rather than copying it across every time. Scratch space is not backed up and it may be cleared periodically, so don't store files there and expect them to still be there months later.
AFS is where home directories (and group space) are housed on DICE machines. Most cluster nodes cannot access AFS. In a cluster, it can only be accessed from the cluster's head node. You can login to the head node in order to copy code and data between AFS and the distributed filesystem - for instance so that you have backup copies. AFS is backed up nightly, so your files there can be retrieved in the event of systems failure.

Files - small batches please

When copying files from the distributed filesystem to scratch space (or vice versa), copy them in small batches (1000 files or less). In tests, copying 10000 small files took roughly 70 times as long as copying 1000 files, instead of the expected 10 times as long - a huge performance loss!

Whenever you tie up the distributed filesystem by trying to access large numbers of files, it becomes effectively unavailable to anyone else.

So when using the distributed filesystem, it's a good idea to access relatively few files at once. One way to do this might be to amalgamate your files using a utility such as zip, then copy the zip file to the distributed filesystem, then have your job copy that zip file to scratch space and unzip it. At the end of a job, output files could be zipped together, then the zip file copied to the distributed filesystem. You can then copy the zip file onwards to (for example) AFS, where you can unzip it.

Files - best practice for jobs

Put the code into the distributed filesystem.
Put the data used as the input for your model into the distributed filesystem.
At the beginning of your sbatch job, copy the data used as the input for your model to the scratch disk of the node you have been assigned to.
- You do this because reading from the distributed filesystem regularly is slow, and most of the models we run do read very regularly from the input data.
- The data can be copied with rsync -u (see man rsync for details).
Save any outputs you need (such as model checkpoints) to the scratch disk of the node you have been assigned to. The reason for this is that sometimes we need to regularly write small files to disk for logging.
At the end of your sbatch job, copy the output data to the distributed filesystem.

Files - deletion

Please delete your files from scratch space as soon as you can - for example, at the end of each batch job, after the job has copied your output files to the distributed filesystem - to make room for others to use it.

From time to time, files are deleted from scratch space if they haven't been accessed for a while, to make room for newer ones.
You may find that your files have been deleted from scratch space but your directories have been left intact.

Jobs - best practice

Please set resource limits for your jobs. The --mem and --time options to sbatch do this. For details type man sbatch.

Set the maximum memory to at least 1000MB less than the node's memory divided by the number of GPUs in the node, and set the maximum time to a few hours (but if --time of a day or more must be used, then be conscious of not using too much resource).

Make your jobs short and fault tolerant. Cluster nodes can fail - for example memory or disk space can be exhausted, or hardware can fail. When they do, and your job fails as a result, you don't want to have to start your job again at the beginning.

Instead of one long job, have a series of short jobs, and have them write output regularly. That way, when a job fails, you won't have to go back very far.

For example, instead of writing one job which does 100 epochs, each of which takes an hour, write a job which just loads the previous epoch’s parameters, performs one epoch, then saves the parameters and exits. This way, random failure will not affect you much.

Interactive Jobs

When you have finished using an interactive session on a GPU server, please quit it, so that other people can have an interactive session. GPU nodes should never be left idle, because other people need to use them too.

Array Jobs

Use array jobs where you can. These make it much easier to control the number of concurrent jobs manually if required, and they're a much better, more explicit way of performing something like a grid search.

Here's an example of how to use array jobs, courtesy of James Owers (thanks James!):

Create a text file (commands.txt) containing all the commands you want to run.
In the bash script to pass to slurm (slurm_array_template.sh) use $SLURM_ARRAY_TASK_ID to select the line of commands.txt - for example:
```
COMMAND="`sed \"${SLURM_ARRAY_TASK_ID}q;d\" $1`"
```
where $1 is the experiments.txt (passed as an argument to the .sh script).
eval "$COMMAND" in slurm_array_template.sh

Submit the experiment to sbatch like this:

EXPT_FILE=commands.txt
NR_EXPTS=`cat ${EXPT_FILE} | wc -l `
MAX_PARALLEL_JOBS=30
sbatch --array=1-${NR_EXPTS}%${MAX_PARALLEL_JOBS} slurm_array_template.sh $EXPT_FILE

You can adjust the number of concurrent jobs executing in parallel using, for example:
```
scontrol update ArrayTaskThrottle=16 JobId=475856
```

GPUs - listing GPU types

Different clusters and even different nodes in a cluster can have different specifications of GPU. You can list the type and number of each GPU available, and you can select specific types of GPU when submitting a job.

[uhtred]bsmith5: sinfo -N -O partition,nodelist:14,gres:60,cpus:10,memory:10
PARTITION           NODELIST      GRES                                                        CPUS      MEMORY    
Teach-Interactive   landonia01    gpu:titan_x:3,gpu:gtx_1060:2                                12        96000     
[...]
Teach-LongJobs      landonia21    gpu:gtx_1060:8                                              12        96000     
Teach-LongJobs      landonia22    gpu:gtx_1060:8                                              12        96000     
Teach-LongJobs      landonia23    gpu:gtx_1060:8                                              12        96000     
Teach-LongJobs      landonia24    gpu:gtx_1060:8                                              12        96000     
Teach-LongJobs      landonia25    gpu:gtx_1060:8                                              12        96000     
General_Usage       letha06       gpu:gtx_1080:8                                              40        128000    
General_Usage       meme          gpu:titan_x:4                                               32        128000

This shows the GPUs, CPUs and CPU memory (megabytes) available to jobs on each node. You can see nodes that have different types of GPU and nodes that have multiple types of GPU themselves. You can choose a particular GPU type by modifying your gres option: --gres=gpu:titan_x:1

Here's a suggestion how to check the memory capacity of a GPU you're planning on using:
srun --partition=General_Usage --gres=gpu:titan_x:1 nvidia-smi --query-gpu=name,memory.total --format=csv

Alternatively, you can look up the GPUs in a list of GPU specifications [wikipedia.org].

Summary of GPU memory

gres	Name	Memory Total (MiB)
a100_80gb	NVIDIA A100 80GB PCIe	81920
a40	NVIDIA A40	46068
a6000	NVIDIA RTX A6000	49140
gtx_1060	NVIDIA GeForce GTX 1060 6GB	6144
gtx_1080_ti	NVIDIA GeForce GTX 1080 Ti	11264
rtx_2080_ti	NVIDIA GeForce RTX 2080 Ti	11264
titan_x	NVIDIA GeForce GTX TITAN X	12288
titan_xp	NVIDIA TITAN Xp	12288
titan_x_pascal	NVIDIA TITAN X (Pascal)	12288

Last reviewed:

01/02/2024

Head Nodes

Networking

Files - places

Files - small batches please

Files - best practice for jobs

Files - deletion

Jobs - best practice

Interactive Jobs

Array Jobs

GPUs - listing GPU types

Summary of GPU memory

Support Options

System Status

GPU cluster tips

Head Nodes

Networking

Files - places

Files - small batches please

Files - best practice for jobs

Files - deletion

Jobs - best practice

Interactive Jobs

Array Jobs

GPUs - listing GPU types

Summary of GPU memory

System Status

Tag list