Interactive GPU nodes
Aim: Describe how to access and use the interactive GPU nodes on the Stoomboot cluster.
Target audience: Users of the Stoomboot cluster and its GPUs.
Introduction
There are different generations and brands of GPUs in our data center. Depending on the software, you may see different performance figures on these nodes, and you may need to tweak and/or recompile your sources for different types.
To make things a bit easier, we have interactive GPU nodes for each GPU type available. These nodes can be used for compilation and tests, as well as computing. For scaling up the computing needs you should use the batch system GPU nodes instead.
Please keep GPU consumption and testing time to a minimum, and run your real jobs on the batch system.
These machines are intended for the following purposes:
- Running interactive jobs, like analysis work (making plots etc).
- Testing GPU jobs with short runtime
- Interaction with the batch system (see below for the relevant commands).
Usage
Access and Use
Three interactive GPU nodes are available via ssh for interactive and testing GPU use:
Node name | Node manufacturer | Node type name | GPU manufacturer | GPU type | GPU number |
---|---|---|---|---|---|
wn-lot-001 | Lenovo | ThinkSystem SR655 | AMD, NVIDIA | Radeon Instinct MI50, Quadro GV100 | 2 |
If trying to access the nodes from home via ssh, use eduVPN and or login through login.nikhef.nl
.
Libraries (CUDA)
The drivers for the GPUs and following versions of the the NVIDIA CUDA libraries are installed:
- 12.5
The relevant version of the CUDA Deep Neural Network (cuDNN) library is also installed.
Python + GPU
To get access to Python software in an environment that supports using the GPUs, it is recommended to use conda to create a virtual environment, activate it and install the software you need.
Conda virtual environment
Create and activate a new virtual environment using:
> conda create --prefix /data/your_project/your_username/gpu_venv python=3.9
> conda activate /data/your_project/your_username/gpu_venv
Installing Python packages inside the virtualenv
To install additional software inside the virtualenv, after activating it, use conda to install it; e.g.:
Sometimes different builds are available, e.g. for different python versions, CUDA versions or for CPU. These can be selected by specifying the exact build:> conda search tensorflow
tensorflow 2.11.0 cpu_py310hd1aba9c_0 conda-forge
tensorflow 2.11.0 cpu_py38h66f0ec1_0 conda-forge
tensorflow 2.11.0 cpu_py39h4655687_0 conda-forge
tensorflow 2.11.0 cuda112py310he87a039_0 conda-forge
tensorflow 2.11.0 cuda112py38hded6998_0 conda-forge
tensorflow 2.11.0 cuda112py39h01bd6f0_0 conda-forge
> conda install tensorflow=2.11.0=cuda112py39h01bd6f0_0
Using the software
Once things are installed, they can be used directly:
> python
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-02-23 14:47:18.909239: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
AMD GPUs
There are a number of AMD GPUs available; either installed in an interactive node or as part of the gpu-amd
stoomboot queue.
Several pieces of software that have support for GPUs have to be installed specifically with this support enabled, or are provided by AMD itself. Examples are tensorflow and pytorch.
Container images
As described on the GPU batch jobs page, container images are available that provide either Tensorflow or PyTorch. These images can also be used to these tools on an interactive node:
> apptainer shell --rocm /cvmfs/unpacked.nikhef.nl/ndpf/rocm-tensorflow/
Apptainer> python3
Python 3.9.18 (main, Jan 4 2024, 00:00:00)
[GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-08-15 15:13:01.081599: E external/local_xla/xla/stream_executor/plugin_registry.cc:93] Invalid plugin kind specified: DNN
2024-08-15 15:13:01.200940: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> print(tf.version.VERSION)
2.15.0
>>> tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Conda environment
To use Tensorflow and PyTorch for AMD GPUs from a conda environment, create a new environment and then use pip
to install the packages.
The available versions of tensorflow-rocm can be found on it's PyPI page https://pypi.org/project/tensorflow-rocm. It can be installed with pip in an environment with a supported Python version:
For PyTorch a special download site is needed. The exact command can be found by visiting https://pytorch.org/get-started/locally and selecting "Linux", "Pip", "Python" and "ROCm 6.1" from the widget. A command such as the following will be shown:
Links
Contact
- Email stbc-users@nikhef.nl for questions about GPUs.
- Chat in Nikhef's Mattermost channel for stbc-users.