Interactive GPU nodes

Aim: Describe how to access and use the interactive GPU nodes on the Stoomboot cluster.

Target audience: Users of the Stoomboot cluster and its GPUs.

Introduction

There are different generations and brands of GPUs in our data center. Depending on the software, you may see different performance figures on these nodes, and you may need to tweak and/or recompile your sources for different types.

To make things a bit easier, we have interactive GPU nodes for each GPU type available. These nodes can be used for compilation and tests, as well as computing. For scaling up the computing needs you should use the batch system GPU nodes instead.

Please keep GPU consumption and testing time to a minimum, and run your real jobs on the batch system.

These machines are intended for the following purposes:

Running interactive jobs, like analysis work (making plots etc).
Testing GPU jobs with short runtime
Interaction with the batch system (see below for the relevant commands).

Usage

Access and Use

Three interactive GPU nodes are available via ssh for interactive and testing GPU use:

Node name	Node manufacturer	Node type name	GPU manufacturer	GPU type	GPU number
`wn-lot-001`	Lenovo	ThinkSystem SR655	AMD, NVIDIA	Radeon Instinct MI50, Quadro GV100	2

If trying to access the nodes from home via ssh, use eduVPN and or login through login.nikhef.nl.

Libraries (CUDA)

The drivers for the GPUs and following versions of the the NVIDIA CUDA libraries are installed:

12.5

The relevant version of the CUDA Deep Neural Network (cuDNN) library is also installed.

Python + GPU

To get access to Python software in an environment that supports using the GPUs, it is recommended to use conda to create a virtual environment, activate it and install the software you need.

Conda virtual environment

Create and activate a new virtual environment using:

> conda create --prefix /data/your_project/your_username/gpu_venv python=3.9
> conda activate /data/your_project/your_username/gpu_venv

Installing Python packages inside the virtualenv

To install additional software inside the virtualenv, after activating it, use conda to install it; e.g.:

> conda install tensorflow=2.11.0
> conda install pytorch=1.13.1

Sometimes different builds are available, e.g. for different python versions, CUDA versions or for CPU. These can be selected by specifying the exact build:

> conda search tensorflow
tensorflow                    2.11.0 cpu_py310hd1aba9c_0  conda-forge
tensorflow                    2.11.0 cpu_py38h66f0ec1_0  conda-forge
tensorflow                    2.11.0 cpu_py39h4655687_0  conda-forge
tensorflow                    2.11.0 cuda112py310he87a039_0  conda-forge
tensorflow                    2.11.0 cuda112py38hded6998_0  conda-forge
tensorflow                    2.11.0 cuda112py39h01bd6f0_0  conda-forge
> conda install tensorflow=2.11.0=cuda112py39h01bd6f0_0

Using the software

Once things are installed, they can be used directly:

> python
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-02-23 14:47:18.909239: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

AMD GPUs

There are a number of AMD GPUs available; either installed in an interactive node or as part of the gpu-amd stoomboot queue.

Several pieces of software that have support for GPUs have to be installed specifically with this support enabled, or are provided by AMD itself. Examples are tensorflow and pytorch.

Container images

As described on the GPU batch jobs page, container images are available that provide either Tensorflow or PyTorch. These images can also be used to these tools on an interactive node:

> apptainer shell --rocm /cvmfs/unpacked.nikhef.nl/ndpf/rocm-tensorflow/
Apptainer> python3
Python 3.9.18 (main, Jan  4 2024, 00:00:00)
[GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-08-15 15:13:01.081599: E external/local_xla/xla/stream_executor/plugin_registry.cc:93] Invalid plugin kind specified: DNN
2024-08-15 15:13:01.200940: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> print(tf.version.VERSION)
2.15.0
>>> tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Conda environment

To use Tensorflow and PyTorch for AMD GPUs from a conda environment, create a new environment and then use pip to install the packages.

The available versions of tensorflow-rocm can be found on it's PyPI page https://pypi.org/project/tensorflow-rocm. It can be installed with pip in an environment with a supported Python version:

> pip install tensorflow-rocm

For PyTorch a special download site is needed. The exact command can be found by visiting https://pytorch.org/get-started/locally and selecting "Linux", "Pip", "Python" and "ROCm 6.1" from the widget. A command such as the following will be shown:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1

Links

Contact

Email stbc-users@nikhef.nl for questions about GPUs.
Chat in Nikhef's Mattermost channel for stbc-users.