You are here

James and Charles cluster

Printer-friendly versionPrinter-friendly version

The James and Charles cluster is available for use by PPAR and DS CDT students.

Admin mailing list

To recieve administrative emails regarding the cluster please sign up to the cdtcluster mailing list which is a low traffic list comtaining information about downtime and other notices.

Nodes

The cluster is made up of the following nodes:

GPU Nodes
charles01 to charles15. nodes 1 and 2 have one NVIDIA Tesla K40m and one Gefroce GTX TITAN X installed, Nodes 3-10 have two NVIDIA Tesla K40ms installed, nodes 13 and 14 currently have a single Geforce TITAN X installed. all nodes have 2 16 core Xeons
Multiprocessor nodes
james01 to james21. nodes have 4 16 core Opterons
Big memory nodes
We have two nodes anna and mary which are similar to the james nodes but have 1TB of memory.

Software

The cluster is running the Standard SL7 version of DICE, if there is anything you wish to be added please submit an RT ticket.

Scheduling

The clusters are switching over to using slurm as a scheduler, this will be staged with the charles machines being done first and then other nodes being added opportunistically as they become free.

Note that the nodes will be reinstalled as part of this process and that /disk/scratch is cleared between installsso you will need to ensure you have copies of any files you want kept.

The head nodes will be upgraded and remain aliased to cdtcluster and cdtcluster1.

Quick Start

You can see what partitions are available using sinfo

[escience5]iainr: sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
Interactive    up    2:00:00      2   idle landonia[01,25]
Standard*      up    8:00:00      2    mix landonia[04,11]
Standard*      up    8:00:00     10   idle landonia[13-17,20-24]
Short          up    4:00:00      1    mix landonia18
Short          up    4:00:00      1   idle landonia02
LongJobs       up 3-08:00:00      1  drain landonia10
LongJobs       up 3-08:00:00      5    mix landonia[03-04,11,18-19]

Note that you will have to specify the partition you wish to use and, if you want to use GPUs how many GPUs you want to use. By default your jobs will run in the stared partition and you will get one GPU.

For Interactive jobs

escience6]iainr: srun  --pty bash
[charles17]iainr: nvidia-smi
Thu Jun 14 08:45:12 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:02:00.0 Off |                  N/A |
| 23%   25C    P8     9W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[charles17]iainr: exit
[escience6]iainr: 

for batch jobs (assuming test.sh is a shell script)


[escience5]iainr: ls test2.sh
test2.sh
[escience5]iainr: cat test2.sh
#!/bin/sh

/bin/hostname
/usr/bin/who
/usr/bin/nvidia-smi

Submit the job using sbatch requesting 2 gpus as requestable resources, in this case we're going to run squeue immediately after the sbatch command because the job will have been scheduled and run before we can type the squeue command

[escience5]iainr: sbatch  --gres=gpu:2 test2.sh ; squeue
Submitted batch job 127096
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            126488  Standard   run.sh s1302760 PD       0:00      1 (PartitionTimeLimit)
            127096  Standard test2.sh    iainr PD       0:00      1 (None)
            126716  LongJobs   run.sh s1302760  R 1-19:24:23      1 landonia04
            126745  LongJobs   run.sh s1302760  R 1-18:36:28      1 landonia03
            126751  LongJobs   run.sh s1302760  R 1-18:14:14      1 landonia03
            126769  LongJobs   run.sh s1302760  R 1-17:44:40      1 landonia18
            126770  LongJobs   run.sh s1302760  R 1-17:44:08      1 landonia18
            126851  LongJobs run_ted_ s1723861  R 1-12:23:27      1 landonia04
            126852  LongJobs run_ted_ s1723861  R 1-12:21:41      1 landonia04
            126998  LongJobs CNN_BALD s1718004  R   16:03:31      1 landonia11
            127000  LongJobs CNN_BALD s1718004  R   16:02:59      1 landonia03
            127002  LongJobs CNN_BALD s1718004  R   16:02:53      1 landonia04
            127026  LongJobs CNN_Kcen s1718004  R   11:59:02      1 landonia18
            127027  LongJobs CNN_Kcen s1718004  R   11:58:55      1 landonia18
            127028  LongJobs CNN_Kcen s1718004  R   11:58:51      1 landonia18
            127030  LongJobs run-epoc s1739461  R   11:46:35      1 landonia11
            127039  LongJobs run-epoc s1739461  R   11:16:33      1 landonia03
            127054  LongJobs run-epoc s1739461  R   10:46:32      1 landonia03
            127087  LongJobs run-epoc s1739461  R    8:56:28      1 landonia18
            127088  LongJobs run-epoc s1739461  R    8:46:27      1 landonia19
            127089  LongJobs run-epoc s1739461  R    8:36:26      1 landonia03
            127090  LongJobs run-epoc s1739461  R    8:26:26      1 landonia19
            127091  LongJobs run-epoc s1739461  R    8:16:25      1 landonia19
            127092  LongJobs run-epoc s1739461  R    8:06:25      1 landonia03
            127093  LongJobs run-epoc s1739461  R    8:06:25      1 landonia19
            127094  LongJobs run-epoc s1739461  R    7:46:24      1 landonia19
[escience5]iainr: ls *.out
slurm-127096.out
[escience5]iainr: cat slurm-127096.out
landonia05.inf.ed.ac.uk
Fri Jun 15 09:38:10 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:04:00.0 Off |                  N/A |
| 24%   33C    P0    26W / 120W |      0MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 106...  Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   33C    P0    28W / 120W |      0MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[escience5]iainr: 

Last reviewed: 
13/09/2017

System Status

Home dirs (AFS)
Network
Mail
Other services
Scheduled downtime

Choose a topic