GPU batch jobs

Aim: Provide the basics of how to access and use the Stoomboot cluster GPU batch system, i.e. how to submit batch jobs to one of the GPU queues.

Target audience: Users of the Stoomboot cluster's GPUs.

Introduction

The Stoomboot cluster has a number of GPU nodes that are suitable for running certain types of algorithms. Both interactive GPU nodes and batch nodes are available.

Currently, the GPUs can be used by only one user at a time. This means that the interactive nodes need more discipline from the users to share this interactive resource than the generic interactive nodes. (Contact stbc-admin@nikhef.nl for coordination.)

The GPU batch nodes can be used through the GPU queues, but only use this queue for actual GPU jobs!

Types of GPU nodes

There are two main GPU manufacturers, and software will typically only work on one brand or the other.

The NVIDIA branded GPUs use software called CUDA.

The AMD branded GPUs use software called ROCm.

Prerequisites

A Nikhef account;
An ssh client.

Usage

Submitting GPU batch system jobs

A number of nodes are equipped with GPUs:

Node name	Number of nodes	Node manufacturer	Node type name	GPU manufacturer	GPU type	GPU number
wn-lot-{002..007}	6	Lenovo	ThinkSystem SR655	AMD	Radeon Instinct MI50	2
wn-lot-{008,009}	2	Lenovo	ThinkSystem SR655	NVIDIA	Tesla V100	2
wn-pijl-{002,003}	2	Asus	ESC4000A-E12	NVIDIA	L40S	2
wn-pijl-{004..007}	4	Asus	ESC4000A-E12	NVIDIA	L40S	4

In order to direct your jobs to a particular type of node, you can set the job requirements to match the node property. Since we have both AMD GPUs and NVIDIA GPUs and a job or software environment will generally not work out of the box on both of these, it makes sense to target only the AMD MI50 GPUs or the NVIDIA GPUs. To do that add the following two lines to your submit file, where "V100" is replaced with "gfx906" to target the AMD MI50 GPUs:

request_gpus = 1
requirements = regexp("V100", TARGET.GPUs_DeviceName)

For more information see: https://batchdocs.web.cern.ch/gpu/index.html

Using the GPUs

Condor will make sure the relevant libaries are avaible inside your container to use the GPUs. So just a tensorflow or pytorch install with CUDA/ROCm support should work fine. If you use one of the LCGviews or other custom enviroment it could be that the LD_LIBRARY_PATH will get overwritten, in this case the libaries that are mounted inside the container are not longer beeing detected. If this is the case you can add /.singularity.d/libs back to the LD_LIBARY_PATH.

Viewing the status of the GPU batch nodes

To view the status of the nodes equipped with GPUs, a similar approach can be used; note the usage of single and double quotes and removal of TARGET. with respect to what you added to the requirements:

$> condor_status -gpu
Name                         ST User               GPUs GPU-Memory GPU-Name               

slot1@wn-cuyp-002.nikhef.nl  Ui _                     1 7.9 GB     NVIDIA GeForce GTX 1080
slot1@wn-cuyp-003.nikhef.nl  Ui _                     1 7.9 GB     NVIDIA GeForce GTX 1080
slot1@wn-lot-008.nikhef.nl   Ui _                     2 31.7 GB    Tesla V100-PCIE-32GB   
slot1@wn-lot-009.nikhef.nl   Ui _                     2 31.7 GB    Tesla V100-PCIE-32GB   
slot1@wn-pijl-002.nikhef.nl  Ui _                     2 44.4 GB    NVIDIA L40S            
slot1@wn-pijl-003.nikhef.nl  Ui _                     2 44.4 GB    NVIDIA L40S            
slot1@wn-pijl-004.nikhef.nl  Ui _                     4 44.4 GB    NVIDIA L40S            
slot1@wn-pijl-005.nikhef.nl  Ui _                     4 44.4 GB    NVIDIA L40S            
slot1@wn-pijl-006.nikhef.nl  Ui _                     4 44.4 GB    NVIDIA L40S            
slot1@wn-pijl-007.nikhef.nl  Ui _                     4 44.4 GB    NVIDIA L40S            

               Total Owner Unclaimed Claimed Preempting Matched  Drain Backfill BkIdle

          Busy     0     0         0       0          0       0      0        0      0
          Idle     0     0        10       0          0       0      0        0      0
      Retiring     0     0         0       0          0       0      0        0      0

         Total     0     0        10       0          0       0      0        0      0

Tensorflow and PyTorch for AMD GPUs

To make the use of the AMD GPUs in the cluster as easy as possible we have provided two container images, one with pytorch on alma9 and one with tensorflow on alma9. For Pytorch the image can be found at /cvmfs/unpacked.nikhef.nl/ndpf/rocm-pytorch/ and for tensorflow at /cvmfs/unpacked.nikhef.nl/ndpf/rocm-tensorflow/. We will keep them up to date when newer versions of pytorch / tensorflow get released. If any specific software is missing inside this container that is required for your workload and can't be provided via an other existing method let us know so this can be added to the images.

For example for tensorflow add the following to the condor submission

request_GPUs = 1
requirements = regexp("gfx906", TARGET.GPUs_DeviceName)
+SingularityImage = "/cvmfs/unpacked.nikhef.nl/ndpf/rocm-tensorflow/"

If for some reason you need to install the required tensorflow or pytorch in an existing enviroment, as long as pip is avaible one can run /project/datagrid/amd_gpu/install_tensorflow.sh to install the amd build of tensorflow.

If there are any specific GPU runtimes that you require feel free to contact us so we can also provide an image for it.

Storing output for GPU batch jobs

Storing output data from a GPU job should follow the same conventions as the CPU batch jobs. See "Storage output data" on the Batch jobs page.

Links

Contact

Email stbc-users@nikhef.nl for questions about GPUs.
Chat in Nikhef's Mattermost channel for stbc-users.