Skip to content

GPU batch jobs

Aim: Provide the basics of how to access and use the Stoomboot cluster GPU batch system, i.e. how to submit batch jobs to one of the GPU queues.

Target audience: Users of the Stoomboot cluster's GPUs.

Introduction

The Stoomboot cluster has a number of GPU nodes that are suitable for running certain types of algorithms. Both interactive GPU nodes and batch nodes are available.

Currently, the GPUs can be used by only one user at a time. This means that the interactive nodes need more discipline from the users to share this interactive resource than the generic interactive nodes. (Contact stbc-admin@nikhef.nl for coordination.)

The GPU batch nodes can be used through the GPU queues, but only use this queue for actual GPU jobs!

Types of GPU nodes

There are two main GPU manufacturers, and software will typically only work on one brand or the other.

The NVIDIA branded GPUs use software called CUDA.

The AMD branded GPUs use software called ROCm.

Prerequisites

  • A Nikhef account;
  • An ssh client.

Usage

Submitting GPU batch system jobs

In order to direct your jobs to a particular type of node, you can set the job requirements to match the node property.

While we are updating our documentation with the recent move to HTCondor, here is a useful page on how to do that.

https://batchdocs.web.cern.ch/gpu/index.html

Node name Number of nodes Node manufacturer Node type name GPU manufacturer GPU type GPU number
wn-lot-{002..007} 6 Lenovo ThinkSystem SR655 AMD Radeon Instinct MI50 2
wn-lot-{008,009} 2 Lenovo ThinkSystem SR655 NVIDIA Tesla V100 2

To make the use of the GPUs in the cluster as easy as possible we have provided two container images, one with pytorch on alma9 and one with tensorflow on alma9. For Pytorch the image can be found at /cvmfs/unpacked.nikhef.nl/ndpf/rocm-pytorch/ and for tensorflow at /cvmfs/unpacked.nikhef.nl/ndpf/rocm-tensorflow/. We will keep them up to date when newer versions of pytorch / tensorflow get released. If any specific software is missing inside this container that is required for your workload and can't be provided via an other existing method let us know and this to the images.

For example for tensorflow add the following to the condor submission file

request_GPUs = 1
+SingularityImage = "/cvmfs/unpacked.nikhef.nl/ndpf/rocm-tensorflow/"

If for some reason you need to install the required tensorflow or pytorch in an existing enviroment, as long as pip is avaible one can run /project/datagrid/amd_gpu/install_tensorflow.sh to install the amd build of tensorflow.

If there are any specific GPU runtimes that you require feel free to contact us so we can also provide an image for it.

Storing output for GPU batch jobs

Storing output data from a GPU job should follow the same conventions as the CPU batch jobs. See "Storage output data" on the Batch jobs page.

Contact