Skip to content

GPU batch jobs

Aim: Provide the basics of how to access and use the Stoomboot cluster GPU batch system, i.e. how to submit batch jobs to one of the GPU queues.

Target audience: Users of the Stoomboot cluster's GPUs.

Introduction

The Stoomboot cluster has a number of GPU nodes that are suitable for running certain types of algorithms. Both interactive GPU nodes and batch nodes are available.

Currently, the GPUs can be used by only one user at a time. This means that the interactive nodes need more discipline from the users to share this interactive resource than the generic interactive nodes. (Contact stbc-admin@nikhef.nl for coordination.)

The GPU batch nodes can be used through the GPU queues, but only use this queue for actual GPU jobs!

Types of GPU nodes

There are two main GPU manufacturers, and software will typically only work on one brand or the other.

The NVIDIA branded GPUs use software called CUDA.

The AMD branded GPUs use software called ROCm.

Prerequisites

  • A Nikhef account;
  • An ssh client.

Usage

Submitting GPU batch system jobs

A number of nodes are equipped with GPUs:

Node name Number of nodes Node manufacturer Node type name GPU manufacturer GPU type GPU number
wn-lot-{002..007} 6 Lenovo ThinkSystem SR655 AMD Radeon Instinct MI50 2
wn-lot-{008,009} 2 Lenovo ThinkSystem SR655 NVIDIA Tesla V100 2

In order to direct your jobs to a particular type of node, you can set the job requirements to match the node property. Since we have both AMD GPUs and NVIDIA GPUs and a job or software environment will generally not work out of the box on both of these, it makes sense to target only the AMD MI50 GPUs or the NVIDIA V100 GPUs. To do that add the following two lines to your submit file, where "V100" is replaced with "gfx906" to target the AMD MI50 GPUs:

request_gpus = 1
requirements = regexp("V100", TARGET.GPUs_DeviceName)

For more information see: https://batchdocs.web.cern.ch/gpu/index.html

Viewing the status of the GPU batch nodes

To view the status of the nodes equipped with GPUs, a similar approach can be used; note the usage of single and double quotes and removal of TARGET. with respect to what you added to the requirements:

$> condor_status -const 'regexp("V100", GPUs_DeviceName)'
Name                       OpSys      Arch   State     Activity LoadAv Mem     ActvtyTime

slot1@wn-lot-008.nikhef.nl LINUX      X86_64 Unclaimed Idle      0.000 220736  8+17:35:28
slot1@wn-lot-009.nikhef.nl LINUX      X86_64 Unclaimed Idle      0.000 241216 37+19:26:13

               Total Owner Claimed Unclaimed Matched Preempting  Drain Backfill BkIdle

  X86_64/LINUX     2     0       0         2       0          0      0        0      0

         Total     2     0       0         2       0          0      0        0      0
$> condor_status -const 'regexp("gfx906", GPUs_DeviceName)'
Name                       OpSys      Arch   State     Activity LoadAv Mem     ActvtyTime

slot1@wn-lot-002.nikhef.nl LINUX      X86_64 Unclaimed Idle      0.000 140855  7+23:10:52
slot1@wn-lot-003.nikhef.nl LINUX      X86_64 Unclaimed Idle      0.000 233015  8+00:02:08
slot1@wn-lot-004.nikhef.nl LINUX      X86_64 Unclaimed Idle      0.000 233015  8+17:35:10
slot1@wn-lot-005.nikhef.nl LINUX      X86_64 Unclaimed Idle      0.000 235063  1+00:20:50
slot1@wn-lot-006.nikhef.nl LINUX      X86_64 Unclaimed Idle      0.000 233015  1+00:21:04
slot1@wn-lot-007.nikhef.nl LINUX      X86_64 Unclaimed Idle      0.000 175671  3+02:55:05

               Total Owner Claimed Unclaimed Matched Preempting  Drain Backfill BkIdle

  X86_64/LINUX     6     0       0         6       0          0      0        0      0

         Total     6     0       0         6       0          0      0        0      0

Tensorflow and PyTorch for AMD GPUs

To make the use of the AMD GPUs in the cluster as easy as possible we have provided two container images, one with pytorch on alma9 and one with tensorflow on alma9. For Pytorch the image can be found at /cvmfs/unpacked.nikhef.nl/ndpf/rocm-pytorch/ and for tensorflow at /cvmfs/unpacked.nikhef.nl/ndpf/rocm-tensorflow/. We will keep them up to date when newer versions of pytorch / tensorflow get released. If any specific software is missing inside this container that is required for your workload and can't be provided via an other existing method let us know so this can be added to the images.

For example for tensorflow add the following to the condor submission

request_GPUs = 1
requirements = regexp("gfx906", TARGET.GPUs_DeviceName)
+SingularityImage = "/cvmfs/unpacked.nikhef.nl/ndpf/rocm-tensorflow/"

If for some reason you need to install the required tensorflow or pytorch in an existing enviroment, as long as pip is avaible one can run /project/datagrid/amd_gpu/install_tensorflow.sh to install the amd build of tensorflow.

If there are any specific GPU runtimes that you require feel free to contact us so we can also provide an image for it.

Storing output for GPU batch jobs

Storing output data from a GPU job should follow the same conventions as the CPU batch jobs. See "Storage output data" on the Batch jobs page.

Contact