Some of the School's GPU compute clusters use the Slurm job scheduler.
Slurm matches computing jobs with computing resources. It tries to ensure that the resources are allocated fairly and that they are used efficiently. To ensure this it has complex prioritisation rules.
How to use Slurm
Slurm is widely used on supercomputers, so there are lots of guides which explain how to use it:
- ⇒ The Slurm Quick Start User Guide.
- ⇒ Slurm examples (HPC @ Uni.lu).
- ⇒ Slurm Quick Start Tutorial (CÉCI).
- ⇒ Slurm: basics, gathering information, creating a job, script examples (OzStar @ Swinburne U of T).
Here's how to use a cluster without breaking it:
- ⇒ cluster tips.
Here are some local examples
A job can either be interactive (you get a shell prompt) or batch (it runs a list of commands in a shell script).
With Slurm, the nodes in a compute cluster are grouped in partitions. A job must be submitted to a partition.
To see what partitions are available
Use sinfo. First login to a cluster's head node, then:
[escience5]iainr: sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Interactive up 2:00:00 2 idle landonia[01,25] Standard* up 8:00:00 2 mix landonia[04,11] Standard* up 8:00:00 10 idle landonia[13-17,20-24] Short up 4:00:00 1 mix landonia18 Short up 4:00:00 1 idle landonia02 LongJobs up 3-08:00:00 1 drain landonia10 LongJobs up 3-08:00:00 5 mix landonia[03-04,11,18-19]
To run a job
Use srun (for an interactive job) or sbatch (for a batch job).
Note that you must specify the partition you wish to use and, if you want to use GPUs, how many GPUs you want to use. By default your jobs will run in the standard partition and you will not get any GPUs.
To run an interactive job
escience6]iainr: srun --gres=gpu:1 --pty bash [charles17]iainr: nvidia-smi Thu Jun 14 08:45:12 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.48 Driver Version: 390.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN X (Pascal) Off | 00000000:02:00.0 Off | N/A | | 23% 25C P8 9W / 250W | 0MiB / 12196MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ [charles17]iainr: exit [escience6]iainr:
To submit a batch job
Assuming that test.sh is a shell script:
[escience5]iainr: ls test2.sh test2.sh [escience5]iainr: cat test2.sh #!/bin/sh /bin/hostname /usr/bin/who /usr/bin/nvidia-smi
Submit the job using sbatch requesting 2 gpus as requestable resources. In this example we run squeue immediately after the sbatch command (because the job will have been scheduled and run before we can type the squeue command).
[escience5]iainr: sbatch --gres=gpu:2 test2.sh ; squeue Submitted batch job 127096 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 126488 Standard run.sh s1302760 PD 0:00 1 (PartitionTimeLimit) 127096 Standard test2.sh iainr PD 0:00 1 (None) 126716 LongJobs run.sh s1302760 R 1-19:24:23 1 landonia04 126745 LongJobs run.sh s1302760 R 1-18:36:28 1 landonia03 126751 LongJobs run.sh s1302760 R 1-18:14:14 1 landonia03 126769 LongJobs run.sh s1302760 R 1-17:44:40 1 landonia18 126770 LongJobs run.sh s1302760 R 1-17:44:08 1 landonia18 126851 LongJobs run_ted_ s1723861 R 1-12:23:27 1 landonia04 126852 LongJobs run_ted_ s1723861 R 1-12:21:41 1 landonia04 126998 LongJobs CNN_BALD s1718004 R 16:03:31 1 landonia11 127000 LongJobs CNN_BALD s1718004 R 16:02:59 1 landonia03 127002 LongJobs CNN_BALD s1718004 R 16:02:53 1 landonia04 127026 LongJobs CNN_Kcen s1718004 R 11:59:02 1 landonia18 127027 LongJobs CNN_Kcen s1718004 R 11:58:55 1 landonia18 127028 LongJobs CNN_Kcen s1718004 R 11:58:51 1 landonia18 127030 LongJobs run-epoc s1739461 R 11:46:35 1 landonia11 127039 LongJobs run-epoc s1739461 R 11:16:33 1 landonia03 127054 LongJobs run-epoc s1739461 R 10:46:32 1 landonia03 127087 LongJobs run-epoc s1739461 R 8:56:28 1 landonia18 127088 LongJobs run-epoc s1739461 R 8:46:27 1 landonia19 127089 LongJobs run-epoc s1739461 R 8:36:26 1 landonia03 127090 LongJobs run-epoc s1739461 R 8:26:26 1 landonia19 127091 LongJobs run-epoc s1739461 R 8:16:25 1 landonia19 127092 LongJobs run-epoc s1739461 R 8:06:25 1 landonia03 127093 LongJobs run-epoc s1739461 R 8:06:25 1 landonia19 127094 LongJobs run-epoc s1739461 R 7:46:24 1 landonia19 [escience5]iainr: ls *.out slurm-127096.out [escience5]iainr: cat slurm-127096.out landonia05.inf.ed.ac.uk Fri Jun 15 09:38:10 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.48 Driver Version: 390.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 106... Off | 00000000:04:00.0 Off | N/A | | 24% 33C P0 26W / 120W | 0MiB / 6078MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 106... Off | 00000000:09:00.0 Off | N/A | | 24% 33C P0 28W / 120W | 0MiB / 6078MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ [escience5]iainr: