Some of the School's GPU compute clusters use the Slurm job scheduler.
Slurm matches computing jobs with computing resources. It tries to ensure that the resources are allocated fairly and that they are used efficiently. To ensure this it has complex prioritisation rules.
How to use Slurm
Slurm is widely used on supercomputers, so there are lots of guides which explain how to use it:
- ⇒ The Slurm Quick Start User Guide.
- ⇒ Slurm Quick Start Tutorial (CÉCI).
- ⇒ Slurm: basics, gathering information, creating a job, script examples (OzStar @ Swinburne U of T).
Here's how to use a cluster without breaking it:
- ⇒ cluster tips.
Here are some local examples
A job can either be interactive (you get a shell prompt) or batch (it runs a list of commands in a shell script).
With Slurm, the nodes in a compute cluster are grouped in partitions. A job must be submitted to a partition.
To see what partitions are available
Use sinfo. First login to a cluster's head node, e.g., ssh mlp , then:
[escience5]iainr: sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Interactive up 2:00:00 2 idle landonia[01,25] Standard* up 8:00:00 2 mix landonia[04,11] Standard* up 8:00:00 10 idle landonia[13-17,20-24] Short up 4:00:00 1 mix landonia18 Short up 4:00:00 1 idle landonia02 LongJobs up 3-08:00:00 1 drain landonia10 LongJobs up 3-08:00:00 5 mix landonia[03-04,11,18-19]
To run a job
Use srun (for an interactive job) or sbatch (for a batch job).
Note that you must specify the partition you wish to use and, if you want to use GPUs, how many GPUs you want to use. By default your jobs will run in the standard partition and you will not get any GPUs.
To run an interactive job
escience6]iainr: srun --gres=gpu:1 --pty bash
[charles17]iainr: nvidia-smi
Thu Jun 14 08:45:12 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:02:00.0 Off | N/A |
| 23% 25C P8 9W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[charles17]iainr: exit
[escience6]iainr:
To submit a batch job
Assuming that test.sh is a shell script:
[escience5]iainr: ls test2.sh test2.sh [escience5]iainr: cat test2.sh #!/bin/sh /bin/hostname /usr/bin/who /usr/bin/nvidia-smi
Submit the job using sbatch requesting 2 gpus as requestable resources. In this example we run squeue immediately after the sbatch command (because the job will have been scheduled and run before we can type the squeue command).
[escience5]iainr: sbatch --gres=gpu:2 test2.sh ; squeue
Submitted batch job 127096
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
126488 Standard run.sh s1302760 PD 0:00 1 (PartitionTimeLimit)
127096 Standard test2.sh iainr PD 0:00 1 (None)
126716 LongJobs run.sh s1302760 R 1-19:24:23 1 landonia04
[escience5]iainr: ls *.out
slurm-127096.out
[escience5]iainr: cat slurm-127096.out
landonia05.inf.ed.ac.uk
Fri Jun 15 09:38:10 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:04:00.0 Off | N/A |
| 24% 33C P0 26W / 120W | 0MiB / 6078MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 106... Off | 00000000:09:00.0 Off | N/A |
| 24% 33C P0 28W / 120W | 0MiB / 6078MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[escience5]iainr: