Ganymede cluster
Aim: Provide basic information on the Nikhef Ganymede HTCondor cluster.
Target audience: Data analysts in gravitational waves.
Introduction
The Ganymede cluster is an HTCondor-based batch system that can be used for data analysis.
The cluster is running HTCondor 9.8, using CentOS 7 for the operating systems, with around 1600 AMD EPYC cores. The dCache, home directories, /project and /data storage systems are all accessible from any of the worker nodes (or execute nodes/points) and the submit host (visar.nikhef.nl).
This cluster is mainly reserved for Nikhef's Gravitational Waves group.
Usage
Access and Use
The Ganymede cluster can be accessed from the "submit node", visar.nikhef.nl. This node allows you to run condor_submit
to submit batch jobs, condor_q -all
to see what jobs are currently running on the system, and condor_status
to show information about what is available in the cluster. Note there is a lot more information available than these simple commands. Try using -help
after each command to see a full explanation of what can be queried with these commands.
Commands to monitor or manage your DAGs or runs
condor_q # Monitor your DAGs/jobs
condor_q -allusers # Same, for all users
condor_q -analyze 846.0 # Details on job 846
condor_q -better 846.0 # More details on job 846 - same as -better-analyze
condor_q -unmatchable # List with some details on jobs that do not match any machines/nodes
condor_rm <job_id> # Remove job or DAG <job_id>
condor_rm <user> # Remove all <user>'s jobs
Note: if a DAG is removed, it may split up into a list of individual jobs for a number of seconds before disappearing.
Commands to monitor the status of the cluster, modules, nodes or cores
condor_status # What is each node doing? (Claimed ~ Busy; Unclaimed ~ Idle)
condor_status -total # Just show totals (see my script condor_status_totals)
condor_status -long # Detailed info on ALL cpus
## Claimed/free cores:
condor_status -total -af:h Name Cpus State | awk '/Unclaimed/ {unc += $2} ; /Claimed/ {cla += $2}; END { print "Total cores " unc+cla ", Claimed " cla " and Unclaimed " unc}'
## Number of active jobs, the cores they're using and their load:
condor_status -const 'Activity == "Busy"' -af CPUsUsage Cpus | awk '{usage += $1; total += $2; jobs += 1}END{print jobs " active jobs are assigned " total " cores and are using " usage " of them (" (usage / total)*100 "%)" }'
## Modules (varying number with varying number of cores):
condor_status -total -af:h Name Cpus State | head
Submitting HTCondor jobs with containers
It is possible to submit jobs that use Apptainer containers (previously known as Singularity) by adding
to the job submission parameters in the .sub file.
Cluster activity
[Add graphs similar to the STBC cluster when they are available.]
Links
- The Ganymede entry in Nikhef's Gravitational Waves wiki.
- Everything you want to know about HTCondor: https://htcondor.readthedocs.io/en/latest/.
Contact
Email grid.sysadmin@nikhef.nl for questions about the Ganymede cluster.