Ganymede cluster

Aim: Provide basic information on the Nikhef Ganymede HTCondor cluster.

Target audience: Data analysts in gravitational waves.

Introduction

The Ganymede cluster is an HTCondor-based batch system that can be used for data analysis.

The cluster is running HTCondor 9.8, using CentOS 7 for the operating systems, with around 1600 AMD EPYC cores. The dCache, home directories, /project and /data storage systems are all accessible from any of the worker nodes (or execute nodes/points) and the submit host (visar.nikhef.nl).

This cluster is mainly reserved for Nikhef's Gravitational Waves group.

Usage

Access and Use

The Ganymede cluster can be accessed from the "submit node", visar.nikhef.nl. This node allows you to run condor_submit to submit batch jobs, condor_q -all to see what jobs are currently running on the system, and condor_status to show information about what is available in the cluster. Note there is a lot more information available than these simple commands. Try using -help after each command to see a full explanation of what can be queried with these commands.

Commands to monitor or manage your DAGs or runs

condor_q                 # Monitor your DAGs/jobs
condor_q -allusers       # Same, for all users

condor_q -analyze 846.0  # Details on job 846
condor_q -better 846.0   # More details on job 846 - same as -better-analyze

condor_q -unmatchable    # List with some details on jobs that do not match any machines/nodes

condor_rm <job_id>       # Remove job or DAG <job_id>
condor_rm <user>         # Remove all <user>'s jobs

Note: if a DAG is removed, it may split up into a list of individual jobs for a number of seconds before disappearing.

Commands to monitor the status of the cluster, modules, nodes or cores

condor_status         # What is each node doing?  (Claimed ~ Busy;  Unclaimed ~ Idle)
condor_status -total  # Just show totals  (see my script condor_status_totals)
condor_status -long   # Detailed info on ALL cpus

## Claimed/free cores:
condor_status -total -af:h  Name Cpus State   | awk '/Unclaimed/ {unc += $2} ; /Claimed/ {cla += $2}; END { print "Total cores " unc+cla ", Claimed " cla " and Unclaimed " unc}'

## Number of active jobs, the cores they're using and their load:
condor_status -const 'Activity == "Busy"' -af CPUsUsage Cpus | awk '{usage += $1; total += $2; jobs += 1}END{print jobs " active jobs are assigned " total " cores and are using " usage " of them (" (usage / total)*100 "%)" }'

## Modules (varying number with varying number of cores):
condor_status -total -af:h  Name Cpus State   | head

Submitting HTCondor jobs with containers

It is possible to submit jobs that use Apptainer containers (previously known as Singularity) by adding

universe                = container
container_image         = ./image.sif

to the job submission parameters in the .sub file.

Cluster activity

[Add graphs similar to the STBC cluster when they are available.]

Links

The Ganymede entry in Nikhef's Gravitational Waves wiki.
Everything you want to know about HTCondor: https://htcondor.readthedocs.io/en/latest/.

Contact

Email grid.sysadmin@nikhef.nl for questions about the Ganymede cluster.