Skip to content

GPU batch jobs

Aim: Provide the basics of how to access and use the Stoomboot cluster GPU batch system, i.e. how to submit batch jobs to one of the GPU queues.

Target audience: Users of the Stoomboot cluster's GPUs.

Introduction

The Stoomboot cluster has a number of GPU nodes that are suitable for running certain types of algorithms. Both interactive GPU nodes and batch nodes are available.

Currently, the GPUs can be used by only one user at a time. This means that the interactive nodes need more discipline from the users to share this interactive resource than the generic interactive nodes. (Contact stbc-admin@nikhef.nl for coordination.)

The GPU batch nodes can be used through the GPU queues, but only use this queue for actual GPU jobs!

Types of GPU nodes

There are two main GPU manufacturers, and software will typically only work on one brand or the other, so we have queues that direct jobs only to the nodes with one type: - gpu-amd is the queue to use for jobs that run on AMD GPUs - gpu-nv is the queue to use for jobs that run on NVIDIA GPUs

Prerequisites

  • A Nikhef account;
  • An ssh client.

Usage

Submitting GPU batch system jobs

In order to direct your jobs to a particular type of node, you can set the job requirements to match the node property. The following example would direct the job to a full node with the dual AMD MI50 cards.

qsub -l 'nodes=1:mi50,walltime=12:00:00,mem=4gb' -q gpu-amd job.sh
This example selects the NVIDIA V100 node(s):
qsub -l 'nodes=1:v100,walltime=12:00:00,mem=4gb' -q gpu-nv job.sh

Do not specify ppn for CPUs with the GPU jobs. The job will get allocated all CPUs on the server which will be either 32 or 64 CPU cores.

Node name Number of nodes Node manufacturer Node type name GPU manufacturer GPU type GPU number qsub tags
wn-cuyp-{002..003} 2 Fujitsu CELCIUS C740 NVIDIA GeForce GTX 1080 1 nvidia, gtx1080
wn-lot-{002..007} 6 Lenovo ThinkSystem SR655 AMD Radeon Instinct MI50 2 amd, mi50
wn-lot-009 1 Lenovo ThinkSystem SR655 NVIDIA Tesla V100 2 nvidia, v100

Storing output for GPU batch jobs

Storing output data from a GPU job should follow the same conventions as the CPU batch jobs. See "Storage output data" on the Batch jobs page.

Contact