CPU batch jobs

Aim: Provide the basics of how to access and use the Stoomboot cluster CPU batch system, i.e. how to submit, monitor and delete batch jobs.

Introduction

Batch system jobs are used to scale out analysis work—from running a program once or twice to running a program/script tens (or hundreds) of times—to get a solution or results faster.

The Stoomboot batch system runs HTCondor.

Migration from Torque

Until the end of June 2024 we ran a batch system called Torque. If you still have scripts using the Torque commands, here is a table showing the equivalent commands for HTCondor.

Command	Torque	HTCondor equivalents
Submit job to batch system	qsub	condor_submit
See the status of your jobs	qstat	condor_q
Remove jobs	qdel	condor_rm
Show nodes or job slots available	pbsnodes	condor_status
Start an interactive session	qsub -i	condor_submit -i file.sub

Some other changes compared to the previous system are:

All jobs will be running in a container. This will allow more flexibility to run operating systems and versions needed for data analysis.
Jobs will still be limited to a maximum runtime of 4 days, depending on the “JobCategory” the job is chosen to run in.
New fields will need to be added to your job submission files to run in the right “queue”.
The maximum amount of CPUs per job we guarantee to be avaible is 16. (However this can change to a higher number if there is a need for)

Prerequisites

A Nikhef account
An ssh client (usually pre-installed on Linux and Mac OS X, PuTTY or Visual Studio Code on Windows)

Usage

Accessing the batch system

See the Stoomboot cluster page for more information on the Stoomboot cluster and how to access it.

To interact with the batch system (e.g. submit a job, query a job status or delete a job), you need to login on one of the interactive Stoomboot nodes.

Submitting jobs to the batch system

If you have never worked with HTCondor, please refer to the User’s Manual with further information about job description files to submit to the cluster: https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html

Some customisations needed to run your jobs on the local Nikhef HTCondor cluster include a few additional requirements for your job submission script. You will need to include

+UseOS = "el9"
## Or bring your own container
+SingularityImage = "/project/myproject/ourimages/myfavouriteimage.sif"

in the submission file.

Next is to include a “JobCategory” for the kind of job you wish to run. This is the duration of time your job plans to run. You can submit to express, short, medium and long

* express 10 minutes
* short    4 hours
* medium  24 hours
* long    96 hours

Job categories can be specified with, for example,

+JobCategory = "short"

Note, your jobs will need to specify JobCategory and SingularityImage or UseOS for a job to be submitted to the batch system.

(see for example the test script below).

Resource requirements and queue selection

You can specify several types of resource requirements for your jobs in the submit script, such as: - Memory - Storage - CPU power (number of CPU cores) - Wall clock time - CPU time

For example, to request 4 cores and 8G of memory:

request_cpus            = 4
request_memory          = 8G

In fact, there is a very rich language to express the needs of a job with many available parameters. The bare minimum that you have to specify are the JobCategory which translates to MaxWallTime, and the SingularityImage or USeOS, where the latter will translate to a pre-defined SingularityImage.

The HTCondor system will take the requirements into account and find a free slot to schedule the job. If no free slot is available, the system waits until a slot becomes available that meets the requirements. At set times, it will combine free slots to have more resources available in a single slot. So if you ask for a lot of memory of a lot of CPU cores, you may need to wait on this process to create a slot for your job.

The reason for different job categories

Although you can specify MaxWallTime directly in your job submit file, there are several reasons we want to split them up into categories. The gist of it is that it helps HTCondor making efficient decisions on where to place jobs. The re-combination of available slots as described above is less wasteful if slots are becoming available at or near the same time, so some way to predict the run length of a job is useful.

An import effect for the users is that we can make slots available at shorter notice, so quick short jobs can be favoured to go through sooner.

The idea is that you should always be able to fire of at least one or two express jobs for testing purposes, i.e. to validate your job submission setup, without having to wait. This is why the express category is there and allows jobs for only up to ten minutes.

A short job category is based on 'a batch in the morning, a batch after lunch'. For those who like to keep an eye on things but may need a middle-of-the-day tweak here or there.

The medium job is 24 hours which hits a sort of sweet spot for batch systems. You should strife to make your jobs run at most for 24 hours if you can.

The long jobs make run for up to 96 hours or 4 days. For operational reasons we don't want to be any longer. If your workflow involves jobs running for more than that, using checkpointing can be of help here. By checkpointing your jobs once per day, you have a point from where an interrupted job can be resumed.

Submitting under another group

By default, HTCondor uses your login name for accounting when submitting a job. Sometimes you may want to use a group name, for instance if your group has an assigned quota of resources. This must be a group that you are a member of, as can be seen in the output of the id command on the command prompt of a Nikhef system.

To submit under a different group, add a line like

accounting_group        = datagrid

to the submit file, replacing datagrid with your group. Note that this line has no leading + sign and no quotes around the group name.

What is my jobs' status?

While your jobs are waiting in the queue or running, their status can be queried with the condor_q command.

dennisvd@stbc-i1$ condor_q 


-- Schedd: taai-007.nikhef.nl : <145.107.7.246:9618?... @ 07/05/24 15:42:57
OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
dennisvd ID: 176827   7/5  15:42      _      _      1      1 176827.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for dennisvd: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for all users: 2422 jobs; 0 completed, 0 removed, 30 idle, 1328 running, 1064 held, 0 suspended

In its simplest form the command shows only the status of your own jobs. It is possible to ask for details on any job by specifying the id number after condor_q and even more detail by using the -long option.

Pay special attention to the 'held' jobs, as these require some attention. HTCondor will not simply throw your job away if something goes wrong, such as overrunning the memory or time limitations; instead, it will place them in the held list. This gives you the opportunity to modify the parameters and releasing it for another attempt. You may also simply choose to cancel them at that point. See the manual for details on condor_release and condor_rm.

When jobs are finished, their information can still be queried with the condor_history command.

Deleting jobs

Users can delete their own jobs with the command condor_rm.

Storing output data

Not all places are appropriate for your output data. Home directories, /project, /data, and dCache all have their strengths and weaknesses. Read about available storage classes on the storage overview page.

Using scratch disk and NFS disk access

When running on the batch system, please be sure to locate all local ‘scratch’ files to the directory pointed to by the environment variable $TMPDIR and not /tmp. The latter is not guaranteed to be very large. The $TMPDIR directory will be automatically cleaned up at the end of the job.

When accessing NFS (Network File System) mounted disks (/project, /data) please keep in mind that the network bandwidth between Stoomboot nodes and the NFS server is limited and that the NFS server capacity is also limited. Running, for example, 50 jobs that read from or write to files on NFS disks at a high rate (‘ntuple analysis’) may result in poor performance of both the NFS server and your jobs.

General Considerations on submission and scheduling

The Stoomboot batch system is a shared facility, mistakes you make can cause others to be blocked or lose work, so pay attention to these considerations.

If you'll be doing a lot of I/O, think about it (or come talk the CT/PDP group). Some example gotchas:

large output files (like logs) going to /tmp might fill up /tmp if many of your jobs land on the same node, and this will hang the node.
submitting lots of jobs, each of which opens lots of files, can cause problems on the storage server. Organizing information in thousands of small files is problematic.

Network traffic

To allow communication between worker nodes the port range 20000-25000 in avaible, for example if one wants to use Dask, specify one of those ports for the scheduler communication.

Debugging and troubleshooting batch jobs

If you want to debug a problem that occurs on a Stoomboot batch job, or you want to make a short trial run for a larger series of batch jobs, there are two ways to gain interactive login access to Stoomboot.

You can either directly login to the interactive nodes, or you can request an ‘interactive’ batch job through

condor_submit -interactive jobscript.sub

You still need to supply the usual jobscript but instead of the executable it will launch an interactive shell in the container image specified in the UseOS or SingularityImage job attribute.

The job ends when you exit the shell.

Example test script

This is an example test script that will give you more information about where a job is running.

testjob.sub:

executable              = test.sh
log                     = test.log
output                  = outfile.txt
error                   = errors.txt
## Can use "el7", "el8", or "el9" for UseOS or you can specify your own 
## SingularityImage but an OS must be specified and in string quotations. 
+UseOS                  = "el9"
## This job can run up to 4 hours. Can choose "express", "short", "medium", or "long".
+JobCategory            = "short"
queue

The test.sh script is the accompanying executable. The text follows below; make sure that you make it an executable by running chmod +x test.sh otherwise the system considers it to be just a text file and it will not run.

Note this is a Bash script. Scripts written in most popular languages can also be run on the batch system — such as Python, C/C++, Perl, Fortran, etc.

#!/bin/bash

: ${SLEEP_AT_END_OF_JOB:=0}

echo "start" > test_file

echo -e ">>>>>>>>>>>>>> User environment\n"
echo -e "\n>>> id"
/usr/bin/id

echo -e "\n>>> pwd"
/bin/pwd
/bin/ls -al

echo -e "\n>>> date"
/bin/date

echo -e "\n>>> env"
/bin/env | sort

echo -e "\n>>> home = $HOME"
ls -al $HOME

echo -e "\n>>> temp directory = $TMPDIR"
ls -al $TMPDIR

echo -e "\n>>> .bashrc"
cat $HOME/.bashrc

echo -e "\n>>> .bash_profile"
cat $HOME/.bash_profile

echo -e "\n>>> .bash_logout"
cat $HOME/.bash_logout

echo -e "\n>>> hostname"
hostname -f

echo -e "\n>>> mounts"
mount

echo -e "\n>>> finished"

if [ $SLEEP_AT_END_OF_JOB -eq 1 ]; then
    echo "Sleeping for a couple of minutes"
    sleep 300
fi
echo "Done"

Links

Contact

Email stbc-users@nikhef.nl or stbc-admin@nikhef.nl for questions about submitting batch jobs.
Chat in Nikhef's Mattermost channel for stbc-users.