|
|
# GPU Cluster
|
|
|
|
|
|
The GPU compute cluster currently offers several hybrid nodes with balanced cpu/ram and gpu ratio. Access maybe granted upon request by additional user permissions. The cluster utilizes the workload manager Slurm. You may start your jobs via the login node login.gpu.cit-ec.net. For Compute tasks it is mandatory to use the slurm scheduler.
|
|
|
The GPU compute cluster currently offers several hybrid nodes with balanced cpu/ram and gpu ratio. Access maybe granted upon request by additional user permissions. The cluster utilizes the workload manager Slurm. You may start your jobs via the login node *login.gpu.cit-ec.net*. For Compute tasks it is mandatory to use the slurm scheduler.
|
|
|
|
|
|
Although the cluster nodes running the TechFak netboot installation the locations /homes and /vol are not available. Therefore an exclusive homes and vol is located at /media/compute/ On initial login your compute home directory will be provisioned under /media/compute/homes/user. It is separated from your regular TechFak home location /homes/user. The compute home is accessible via files.techfak.de and a regular TechFak netboot system like compute.techfak.de.
|
|
|
Although the cluster nodes running the TechFak netboot installation, the locations `/homes` and `/vol` are not available. Therefore an exclusive homes and vol is located at `/media/compute/` On initial login your compute home directory will be provisioned under `/media/compute/homes/user`. It is separated from your regular TechFak home location `/homes/user`. The compute home is accessible via [files.techfak.de](https://www.techfak.net/dienste/remote/files) and a regular TechFak netboot system like *compute.techfak.de*.
|
|
|
|
|
|
## Support Channel
|
|
|
|
|
|
On the Universities Matrix Service join the Channel `#citec-gpu:uni-bielefeld.de`
|
|
|
|
|
|
## Slurm Basics
|
|
|
|
|
|
Slurm jobs may be scheduled by slurm client tools. For a brief introduction checkout Slurm Quickstart
|
|
|
Slurm jobs may be scheduled by slurm client tools. For a brief introduction checkout [Slurm Quickstart](https://slurm.schedmd.com/quickstart.html)
|
|
|
|
|
|
There are some possibilities to schedule a task in a slurm controlled system.
|
|
|
|
|
|
The main paradigm for job scheduling in a slurm cluster is using sbatch. It schedules a Job and requests the user claimed resources.
|
|
|
The main paradigm for job scheduling in a slurm cluster is using `sbatch`. It schedules a Job and requests the user claimed resources.
|
|
|
|
|
|
The cluster provides two types of resources: CPUs and GPUs which can be requested for jobs in variable amounts.
|
|
|
|
|
|
The GPUs in the cluster come in two flavours: The GPU objects tesla and gtx.
|
|
|
|
|
|
You may request a single gpu object via the option --gres=gpu:1. The Slurm scheduler reserves one gpu object exclusive for your job and therefore schedules the jobs by free resources.
|
|
|
You may request a single gpu object via the option `--gres=gpu:1`. The Slurm scheduler reserves one gpu object exclusive for your job and therefore schedules the jobs by free resources.
|
|
|
|
|
|
CPUs are requested with -c or --cpus-per-task= options. For further information have a look at the man-pages of srun and sbatch. Reading the Slurm documentation is also highly recommended
|
|
|
CPUs are requested with `-c` or `--cpus-per-task= options`. For further information have a look at the man-pages of srun and sbatch. Reading the [Slurm documentation](https://slurm.schedmd.com/documentation.html) is also highly recommended
|
|
|
|
|
|
The commands sinfo and squeue provide detailed information about the clusters state and jobs running.
|
|
|
|
|
|
### CPU management and CPU-only jobs
|
|
|
|
|
|
Though the facility is called GPU-cluster, it also appropriate for CPU-only computing as it not only provides 12 GPUs but also 240 CPU-cores. Effective utilization of the CPU-resources can be tricky so you should make yourself familiar with CPU-management.
|
|
|
Though the facility is called GPU-cluster, it also appropriate for CPU-only computing as it not only provides 12 GPUs but also 240 CPU-cores. Effective utilization of the CPU-resources can be tricky so you should make yourself familiar with [CPU-management](https://slurm.schedmd.com/cpu_management.html).
|
|
|
|
|
|
### Choosing the appropriate partition
|
|
|
|
|
|
The cluster offers two partitions. Partitions can be considered as separate queues with slightly different features.
|
|
|
|
|
|
Partition selection is done with the parameter -p or --partition= in your srun-commands and sbatch-scripts. The default partition is 'cpu'. Jobs which aren't mapped on a partition will be started there.
|
|
|
Partition selection is done with the parameter `-p` or `--partition=` in your *srun-commands* and *sbatch-scripts*. The default partition is *cpu*. Jobs which aren't mapped on a partition will be started there.
|
|
|
|
|
|
We have the 'cpu' and 'gpu' partition. If you have a cpu-only job (not requesting any GPU-resources with --gres=gpu:n ), you should start it on the 'cpu' partition.
|
|
|
We have the *cpu* and *gpu* partition. If you have a cpu-only job (not requesting any GPU-resources with `--gres=gpu:n` ), you should start it on the *cpu* partition.
|
|
|
|
|
|
A job using GPU should be started on the 'gpu' partition, with one exception. Jobs which request one GPU (with --gres=gpu:1) and more than 2 CPUs (with the -c or --cpus-per-task option) should use the 'cpu' partition.
|
|
|
A job using GPU should be started on the *gpu* partition, with one exception. Jobs which request one GPU (with `--gres=gpu:1`) and more than 2 CPUs (with the `-c` or `--cpus-per-task` option) should use the *cpu* partition.
|
|
|
|
|
|
The reason for this policy is not obvious and will be explained under *GPU Blocking*
|
|
|
|
|
|
The example example.job.sbatch request one GTX 1080 Ti for the job and calls the payload example.job.sh via srun.
|
|
|
The example `example.job.sbatch` request one GTX 1080 Ti for the job and calls the payload example.job.sh via srun.
|
|
|
|
|
|
File: example.job.sbatch
|
|
|
File: `example.job.sbatch`
|
|
|
|
|
|
``` bash
|
|
|
#!/bin/bash
|
... | ... | @@ -50,7 +54,7 @@ File: example.job.sbatch |
|
|
srun example.job.sh
|
|
|
```
|
|
|
|
|
|
File: example.job.sh
|
|
|
File: `example.job.sh`
|
|
|
|
|
|
``` bash
|
|
|
#!/bin/bash
|
... | ... | |