Some of the School's GPU compute clusters use the Slurm job scheduler.

Slurm matches computing jobs with computing resources. It tries to ensure that the resources are allocated fairly and that they are used efficiently. To ensure this it has complex prioritisation rules.

How to use Slurm

Slurm is widely used on supercomputers, so there are lots of guides which explain how to use it:

Here's how to use a cluster without breaking it:

Here are some local examples

A job can either be interactive (you get a shell prompt) or batch (it runs a list of commands in a shell script).
With Slurm, the nodes in a compute cluster are grouped in partitions. A job must be submitted to a partition.

To see what partitions are available

Use sinfo. First login to a cluster's head node, e.g., ssh mlp , then:

[escience5]iainr: sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
Interactive    up    2:00:00      2   idle landonia[01,25]
Standard*      up    8:00:00      2    mix landonia[04,11]
Standard*      up    8:00:00     10   idle landonia[13-17,20-24]
Short          up    4:00:00      1    mix landonia18
Short          up    4:00:00      1   idle landonia02
LongJobs       up 3-08:00:00      1  drain landonia10
LongJobs       up 3-08:00:00      5    mix landonia[03-04,11,18-19]

To run a job

Use srun (for an interactive job) or sbatch (for a batch job).
Note that you must specify the partition you wish to use and, if you want to use GPUs, how many GPUs you want to use. By default your jobs will run in the standard partition and you will not get any GPUs.

To run an interactive job
escience6]iainr: srun --gres=gpu:1 --pty bash
[charles17]iainr: nvidia-smi
Thu Jun 14 08:45:12 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:02:00.0 Off |                  N/A |
| 23%   25C    P8     9W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[charles17]iainr: exit
[escience6]iainr: 

To submit a batch job

Assuming that test.sh is a shell script:

[escience5]iainr: ls test2.sh
test2.sh
[escience5]iainr: cat test2.sh
#!/bin/sh

/bin/hostname
/usr/bin/who
/usr/bin/nvidia-smi

Submit the job using sbatch requesting 2 gpus as requestable resources. In this example we run squeue immediately after the sbatch command (because the job will have been scheduled and run before we can type the squeue command).

[escience5]iainr: sbatch  --gres=gpu:2 test2.sh ; squeue
Submitted batch job 127096
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            126488  Standard   run.sh s1302760 PD       0:00      1 (PartitionTimeLimit)
            127096  Standard test2.sh    iainr PD       0:00      1 (None)
            126716  LongJobs   run.sh s1302760  R 1-19:24:23      1 landonia04
[escience5]iainr: ls *.out
slurm-127096.out
[escience5]iainr: cat slurm-127096.out
landonia05.inf.ed.ac.uk
Fri Jun 15 09:38:10 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:04:00.0 Off |                  N/A |
| 24%   33C    P0    26W / 120W |      0MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 106...  Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   33C    P0    28W / 120W |      0MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[escience5]iainr: 

Last reviewed: 
03/05/2023