Skip to content

CPU batch jobs

Aim: Provide the basics of how to access and use the Stoomboot cluster CPU batch system, i.e. how to submit, monitor and delete batch jobs.

Prerequisites

  • A Nikhef account
  • An ssh client (usually pre-installed on Linux and Mac OS X, PuTTY or Visual Studio Code on Windows)

Introduction

Batch system jobs are used to scale out analysis work—from running a program once or twice to running a program/script tens (or hundreds) of times—to get a solution or results faster.

The Stoomboot cluster uses the HTCondor software scheduler to distribute jobs across more than 5k cores. The dCache, home directories, /project and /data storage systems are all accessible from any of the worker nodes (or execute nodes/points) and the interactive Stoomboot servers.

Some points to note about the batch system:

  1. All jobs will be running in a container. If the job submission script does not specify a container, the job(s) will run in a Alma Linux 9 container.
  2. Jobs will be limited to a maximum runtime of 4 days, depending on the “JobCategory” the job is chosen to run in.
  3. The maximum amount of CPUs per job we guarantee to be avaible is 16. However this can change to a higher number if there is a need.

Accessing the batch system

To interact with the batch system (e.g. submit a job, query a job status or delete a job), you need to login on one of the interactive Stoomboot nodes. These nodes allow you to run condor_submit to submit batch jobs, condor_q -all to see what jobs are currently running on the system, and condor_status to show information about what is available in the cluster. Note there is a lot more information available than these simple commands. Try using -help after each command to see a full explanation of what can be queried with these commands.

Usage

Once logged in to one of the stbc-i* nodes, below is a list of useful commands for submitting, monitoring and removing jobs in HTCondor. For more detailed information about using HTCondor Python bindings or DAG, visit the tutorial slides: https://kb.nikhef.nl/ct/Course_Tutorials.html#batch-computing-tutorials

Commands to monitor or manage your HTCondor jobs (or DAGs)
condor_q                 # Monitor your DAGs/jobs
condor_q -allusers       # Same, for all users

condor_q -analyze 846.0  # Details on job 846
condor_q -better 846.0   # More details on job 846 - same as -better-analyze

condor_q -unmatchable    # List with some details on jobs that do not match any machines/nodes

condor_rm <job_id>       # Remove job or DAG <job_id>
condor_rm <user>         # Remove all <user>'s jobs

Note: if a DAG is removed, it may split up into a list of individual jobs for a number of seconds before disappearing.

Commands to monitor the status of the cluster, modules, nodes or cores
condor_status         # What is each node doing?  (Claimed ~ Busy;  Unclaimed ~ Idle)
condor_status -total  # Just show totals  
condor_status -long   # Detailed info on ALL cpus

### Claimed/free cores:
condor_status -total -af:h  Name Cpus State   | awk '/Unclaimed/ {unc += $2} ; /Claimed/ {cla += $2}; END { print "Total cores " unc+cla ", Claimed " cla " and Unclaimed " unc}'

### Number of active jobs, the cores they're using and their load:
condor_status -const 'Activity == "Busy"' -af CPUsUsage Cpus | awk '{usage += $1; total += $2; jobs += 1}END{print jobs " active jobs are assigned " total " cores and are using " usage " of them (" (usage / total)*100 "%)" }'

### Modules (varying number with varying number of cores):
condor_status -total -af:h  Name Cpus State   | head

Creating a job submission script

If you have never worked with HTCondor, please refer to the User’s Manual with further information about job description files to submit to the cluster: https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html

Some customisations needed to run your jobs on the local Nikhef HTCondor cluster include a few additional requirements for your job submission script. You will need to include

+UseOS = "el9"
## Or bring your own container
+SingularityImage = "/project/myproject/ourimages/myfavouriteimage.sif"

in the submission file.

Next is to include a “JobCategory” for the kind of job you wish to run. This is the duration of time your job plans to run. You can submit to express, short, medium and long

* express 10 minutes
* short    4 hours
* medium  24 hours
* long    96 hours

Job categories can be specified with, for example,

+JobCategory = "short"

Note, your jobs will need to specify JobCategory and SingularityImage or UseOS for a job to be submitted to the batch system.

(see for example the test script below).

Resource requirements and queue selection

You can specify several types of resource requirements for your jobs in the submit script, such as: - Memory - Storage - CPU power (number of CPU cores) - Wall clock time - CPU time

For example, to request 4 cores and 8G of memory:

request_cpus            = 4
request_memory          = 8G

In fact, there is a very rich language to express the needs of a job with many available parameters. The bare minimum that you have to specify are the JobCategory which translates to MaxWallTime, and the SingularityImage or USeOS, where the latter will translate to a pre-defined SingularityImage.

The HTCondor system will take the requirements into account and find a free slot to schedule the job. If no free slot is available, the system waits until a slot becomes available that meets the requirements. At set times, it will combine free slots to have more resources available in a single slot. So if you ask for a lot of memory of a lot of CPU cores, you may need to wait on this process to create a slot for your job.

The reason for different job categories

Although you can specify MaxWallTime directly in your job submit file, there are several reasons we want to split them up into categories. The gist of it is that it helps HTCondor making efficient decisions on where to place jobs. The re-combination of available slots as described above is less wasteful if slots are becoming available at or near the same time, so some way to predict the run length of a job is useful.

An import effect for the users is that we can make slots available at shorter notice, so quick short jobs can be favoured to go through sooner.

The idea is that you should always be able to fire of at least one or two express jobs for testing purposes, i.e. to validate your job submission setup, without having to wait. This is why the express category is there and allows jobs for only up to ten minutes.

A short job category is based on 'a batch in the morning, a batch after lunch'. For those who like to keep an eye on things but may need a middle-of-the-day tweak here or there.

The medium job is 24 hours which hits a sort of sweet spot for batch systems. You should strife to make your jobs run at most for 24 hours if you can.

The long jobs make run for up to 96 hours or 4 days. For operational reasons we don't want to be any longer. If your workflow involves jobs running for more than that, using checkpointing can be of help here. By checkpointing your jobs once per day, you have a point from where an interrupted job can be resumed.

Submitting under another group

By default, HTCondor uses your login name for accounting when submitting a job. Sometimes you may want to use a group name, for instance if your group has an assigned quota of resources. This must be a group that you are a member of, as can be seen in the output of the id command on the command prompt of a Nikhef system.

To submit under a different group, add a line like

accounting_group        = datagrid
to the submit file, replacing datagrid with your group. Note that this line has no leading + sign and no quotes around the group name.

What is my jobs' status?

While your jobs are waiting in the queue or running, their status can be queried with the condor_q command.

dennisvd@stbc-i1$ condor_q 


-- Schedd: taai-007.nikhef.nl : <145.107.7.246:9618?... @ 07/05/24 15:42:57
OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
dennisvd ID: 176827   7/5  15:42      _      _      1      1 176827.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for dennisvd: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for all users: 2422 jobs; 0 completed, 0 removed, 30 idle, 1328 running, 1064 held, 0 suspended

In its simplest form the command shows only the status of your own jobs. It is possible to ask for details on any job by specifying the id number after condor_q and even more detail by using the -long option.

Pay special attention to the 'held' jobs, as these require some attention. HTCondor will not simply throw your job away if something goes wrong, such as overrunning the memory or time limitations; instead, it will place them in the held list. This gives you the opportunity to modify the parameters and releasing it for another attempt. You may also simply choose to cancel them at that point. See the manual for details on condor_release and condor_rm.

When jobs are finished, their information can still be queried with the condor_history command.

Deleting jobs

Users can delete their own jobs with the command condor_rm.

Storing output data

Not all places are appropriate for your output data. Home directories, /project, /data, and dCache all have their strengths and weaknesses. Read about available storage classes on the storage overview page.

Using scratch disk and NFS disk access

When running on the batch system, please be sure to locate all local ‘scratch’ files to the directory pointed to by the environment variable $TMPDIR and not /tmp. The latter is not guaranteed to be very large. The $TMPDIR directory will be automatically cleaned up at the end of the job.

When accessing NFS (Network File System) mounted disks (/project, /data) please keep in mind that the network bandwidth between Stoomboot nodes and the NFS server is limited and that the NFS server capacity is also limited. Running, for example, 50 jobs that read from or write to files on NFS disks at a high rate (‘ntuple analysis’) may result in poor performance of both the NFS server and your jobs.

General Considerations on submission

The Stoomboot batch system is a shared facility, mistakes you make can cause others to be blocked or lose work, so pay attention to these considerations.

If you'll be doing a lot of I/O, think about it (or come talk the CT/PDP group). Some example gotchas:

  • large output files (like logs) going to /tmp might fill up /tmp if many of your jobs land on the same node, and this will hang the node.
  • submitting lots of jobs, each of which opens lots of files, can cause problems on the storage server. Organizing information in thousands of small files is problematic.
Network traffic

To allow communication between worker nodes the port range 20000-25000 in avaible, for example if one wants to use Dask, specify one of those ports for the scheduler communication.

Notes on scheduling

Nikhef sets some policy guardrails to ensure fair sharing across the Stoomboot cluster. The rails are a set of policies determining which kind of jobs can get how many resources, and of those, making sure the resources are spread fairly.

Taking an example of a group of jobs in the JobCategory medium, medium jobs are allowed to use up to 5080 cores out of the 5800 present. However those are not all available, there are 1334 cores of work running in "long" and 1409 running in "short".

HTCondor tracks each user’s recent use on the cluster and hands out job slots to waiting jobs in such a way to make sure that everyone gets the same portion of resources, as long as it does not break the guardrails. Looking at a snapshot of the current priorities in the cluster, where stbc-019.nikhef.nl is the HTCondor central manager:

condor_userprio -name stbc-019.nikhef.nl                                                                                                           
Last Priority Update: 12/6  11:32
                     Effective   Priority  Wghted Total Usage  Time Since Submitter Submitter
User Name             Priority    Factor   In Use (wghted-hrs) Last Usage   Floor    Ceiling
------------------- ------------ --------- ------ ------------ ---------- --------- ---------
user_b@nikhef.nl      773934.86   1000.00    804    143746.73      <now>      5.00
user_d@nikhef.nl    1254878.44   1000.00    960   6940572.80      <now>
user_a@nikhef.nl    1262253.99   1000.00   1512    529714.32      <now>
user_c@nikhef.nl    1269606.17   1000.00   1729   4982512.03      <now>
------------------- ------------ --------- ------ ------------ ---------- --------- ---------

NOTE: condor calls this a "priority” but it is a funny definition: the LOWER the priority, the more resources HTCondor will give a user. The number in column 2 is the weighted average recent usage times 1000.

User_a’s priority is about the same as user_c’s, and almost twice that of user_b, while user_a has three times as many cores right now as user_c, and about twice as many cores as user_b. Remember the inverse relation: this means since user_a has an equal priority as user_c, but user_c has only 1/3 of the number of cores as user_a, HTCondor will try to help user_c catch up. For user_b it is a bit more complicated because user_b is running cores in multiple job classes (JobCategory) and they have a higher rank (priority value is lower), but they will get more cores assigned in order to catch up.

There is one complication, that is that if user_a asks for a lot of memory, their jobs are assigned more weight, so a single job might count against their slot allocation by a factor related to memory request.

In summary, HTCondor scheduling is not simple, although it is in most cases quite fair.
The complexity is required to reach two conflicting goals: fast response (I submit ten jobs, I do not want to wait four hours before they run) and high utilization (I do not want the cluster to be half empty even though there are thousands of waiting jobs).

For more information on user priorities and scheduling, see the HTCondor documentation: https://htcondor.readthedocs.io/en/latest/admin-manual/cm-configuration.html#user-priorities-and-negotiation

Debugging and troubleshooting batch jobs

If you want to debug a problem that occurs on a Stoomboot batch job, or you want to make a short trial run for a larger series of batch jobs, there are two ways to gain interactive login access to Stoomboot.

You can either directly login to the interactive nodes, or you can request an ‘interactive’ batch job through

condor_submit -interactive jobscript.sub

You still need to supply the usual jobscript but instead of the executable it will launch an interactive shell in the container image specified in the UseOS or SingularityImage job attribute.

The job ends when you exit the shell.

Example test script

This is an example test script that will give you more information about where a job is running.

testjob.sub:

executable              = test.sh
log                     = test.log
output                  = outfile.txt
error                   = errors.txt
## Can use "el7", "el8", or "el9" for UseOS or you can specify your own 
## SingularityImage but an OS must be specified and in string quotations. 
+UseOS                  = "el9"
## This job can run up to 4 hours. Can choose "express", "short", "medium", or "long".
+JobCategory            = "short"
queue

The test.sh script is the accompanying executable. The text follows below; make sure that you make it an executable by running chmod +x test.sh otherwise the system considers it to be just a text file and it will not run.

Note this is a Bash script. Scripts written in most popular languages can also be run on the batch system — such as Python, C/C++, Perl, Fortran, etc.

#!/bin/bash

: ${SLEEP_AT_END_OF_JOB:=0}

echo "start" > test_file

echo -e ">>>>>>>>>>>>>> User environment\n"
echo -e "\n>>> id"
/usr/bin/id

echo -e "\n>>> pwd"
/bin/pwd
/bin/ls -al

echo -e "\n>>> date"
/bin/date

echo -e "\n>>> env"
/bin/env | sort

echo -e "\n>>> home = $HOME"
ls -al $HOME

echo -e "\n>>> temp directory = $TMPDIR"
ls -al $TMPDIR

echo -e "\n>>> .bashrc"
cat $HOME/.bashrc

echo -e "\n>>> .bash_profile"
cat $HOME/.bash_profile

echo -e "\n>>> .bash_logout"
cat $HOME/.bash_logout

echo -e "\n>>> hostname"
hostname -f

echo -e "\n>>> mounts"
mount

echo -e "\n>>> finished"

if [ $SLEEP_AT_END_OF_JOB -eq 1 ]; then
    echo "Sleeping for a couple of minutes"
    sleep 300
fi
echo "Done"

Contact