Skip to content

CPU batch jobs

Aim: Provide the basics of how to access and use the Stoomboot cluster CPU batch system, i.e. how to submit, monitor and delete batch jobs.

Target audience: Users of the Stoomboot cluster's CPUs.

Introduction

Batch system jobs are used to scale out analysis work — from running a program once or twice to running a program/script tens (or hundreds) of times — to get a solution or results faster.

Prerequisites

  • A Nikhef account
  • An ssh client (usually pre-installed on Linux and Mac OS X, PuTTY on Windows)

Usage

Accessing the batch system

See the Stoomboot cluster page for more information on the Stoomboot cluster and how to access it.

To interact with the batch system (e.g. submit a job, query a job status or delete a job), you need to login on a Linux machine at Nikhef. Machines that can be used include desktops managed by the CT department, the login hosts and the interactive Stoomboot nodes.

Submitting jobs to the batch system

The command qsub is used to submit jobs to the cluster. A typical use of this command is:

qsub [-q <queue>] [-l resource_name[=[value]][,resource_name[=[value]],...]] [script]

The optional argument script is the user-provided script that does the work (see for example the test script below).

A job is always submitted to a queue (ordered (waiting) list of jobs). The section Available queues below provides more information about the available queues and their properties.

Resource requirements and queue selection

Choose the queue that best matches your requirements for the job duration. The scheduler favors jobs shorter in duration over longer jobs, but jobs that overrun the time limit will be killed.

The scheduler will consider all jobs in the system, not just individual queues. It will calculate the priority based on how many resources are needed and on past resource consumption by the user.

The resources that can be requested include:

  • the number of processors,
  • the amount of memory, and
  • the wall clock time of the job.

For example:

## Request 4 cores on 1 node:
-l nodes=1:ppn=4

## Request a walltime of 32 hours, 10 minutes and 5 seconds:
-l walltime=32:10:05
Resources that are not specified will use default values based on the queue.

Common command-line options for qsub

-j oe: merge stdout and stderr in a single file; a single “.o” file is written.
-q <queuename>: specify batch queue; the default queue is “generic”
-o <filename>: specify a different filename for stdout
-V: pass all environment variables from the submitting shell to the batch job (with the exception of $PATH)
-l host=v100: run the job on a Stoomboot node with a specific type of GPU
A full list of options and explanations can be found in the online manual or in the qsub man page.

Available queues

The queues have different parameters for how many jobs you can submit and run at the same time, and how long each job can run for (measured in wall-clock time). Please choose and specify an appropriate queue when submitting a job:

CentOS7 queues Max. walltime [HH:MM] Remarks
express7 00:10 test jobs (max. 2 running jobs per user)
short7 04:00 short turnaround jobs
generic7 24:00 default queue
long7 48:00 max. walltime of 96:00 via resource list

You could also enter express, short, generic or long without the 7 at the end. These are technically routing queues, which are a holdover from when we still had CentOS 6 as an OS in the Stoomboot cluster. Any jobs submitted to a routing queue will be forwarded, or 'routed', to is equivalent destination queue with a '7'.

routing queue destination queue
express express7
short short7
generic generic7
long long7

Submitting under another group

By default, qsub uses the primary group of the Unix account when submitting a job. Sometimes you may need to use a different group, for example because the primary group has no allocation on the batch server. (If you're unsure what this means, type id from a Nikhef node like login.nikhef.nl or stbc-i1.nikhef.nl to see what your group assignments are.)

To submit with a different group, add -W group_list=<new-group> to qsub. The following example forces the use of the group atlas:

qsub -W group_list=atlas some-script.sh

What is my jobs' status?

The qstat command shows the status of all jobs in the system. Status code ‘C’ indicates completed, ‘R’ indicates running and ‘Q’ indicates queued.

qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1001.burrell              script.sh        user1           21:22:33 R generic7
1002.burrell              myscript.csh     user2           08:15:38 C long7
1003.burrell              myscript.csh     user2           00:02:13 R long7
1004.burrell              myscript.csh     user2                  0 Q long7

The above example shows 4 jobs from 2 different users. Two jobs are running (1001 and 1003), one has finished (1002) and one is still waiting in the queue.

Only completed jobs that completed less than 10 minutes ago are listed with status ‘C’. Output of jobs that completed longer ago is kept, but they are no longer listed in the status overview.

More detailed output is shown with qstat -n1:

qstat -n1
                                                                         Req'd  Req'd   Elap
Job ID                  Username Queue       Jobname          SessID NDS   TSK Memory Time  S Time
--------------------    -------- --------    ---------------- ------ ----- --- ------ ----- - -----
1001.burrell.nikhef.n   user1    generic7    script.sh        10280   --   --    --   36:00 R 21:22   stbc-081
1002.burrell.nikhef.n   user2    long7       myscript.csh     28649   --   --    --   96:00 R 08:15   stbc-043
1003.burrell.nikhef.n   user2    long7       myscript.csh     12365   --   --    --   96:00 R 00:02   stbc-028
1004.burrell.nikhef.n   user2    long7       myscript.csh       --    --   --    --   96:00 R   --     --

The qstat -u <username> command shows the status of the jobs owned by user username:

> qstat -u username
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1241010.burrell.nikhef  username    generic7 test_qsub.sh      27126   --     --     --   00:10:00 R  00:00:11

More information on qstat can be found in the manual.

Deleting jobs

Users can delete their own jobs with the command qdel:

qdel 1001

This example removes the job with ID 1001. The job ID can be obtained with the qstat command described above.

Storing output data

Not all places are appropriate for your output data. Home directories, /project, /data, and dCache all have their strengths and weaknesses. Read about available storage classes on the storage overview page.

Using scratch disk and NFS disk access

When running on the batch system, please be sure to locate all local ‘scratch’ files to the directory pointed to by the environment variable $TMPDIR and not /tmp. The latter is very small (a few GiB) and when filled up will give all kinds of problems for you and other users. The disk pointed to by $TMPDIR is large and fast. This directory will be automatically cleaned up at the end of the job.

When accessing NFS (Network File System) mounted disks (/project, /data) please keep in mind that the network bandwidth between Stoomboot nodes and the NFS server is limited and that the NFS server capacity is also limited. Running, for example, 50 jobs that read from or write to files on NFS disks at a high rate (‘ntuple analysis’) may result in poor performance of both the NFS server and your jobs.

General Considerations on submission and scheduling

The Stoomboot batch system is a shared facility, mistakes you make can cause others to be blocked or lose work, so pay attention to these considerations.

If you'll be doing a lot of I/O, think about it (or come talk the CT/PDP group). Some example gotchas:

  • large output files (like logs) going to /tmp might fill up /tmp if many of your jobs land on the same node, and this will hang the node.
  • submitting lots of jobs, each of which opens lots of files, can cause problems on the storage server. Organizing information in thousands of small files is problematic.
  • there is an inherent dead time of about 30 seconds associated with each job. Hence very short jobs are horrendously inefficient, they also cause high loads on several services and this sometimes causes problems for others. If your average job run time is less than a minute, please consider how to re-pack your work into jobs.

How scheduling is done

The information below may help you to understand why your jobs are not running, or why there are fewer jobs running than you expected.

The batch system works on a fair-share scheduling basis. Each group at Nikhef gets an equal share allocated, and within each group, all users are equal. The scheduler makes its decision based on a couple pieces of information:

  • how much time your group (e.g., atlas, virgo or km3net) has used over the last 8 hours, and
  • how much time you (e.g., templon, verkerke or stanb) have used over the last 8 hours.

The group number is converted to a group ranking component. If your group has used less than the standard share in the last 8 hours, this number is positive, getting larger the less cpu time the group has used. If you have used more than the standard share, the number is negative, getting more negative the more you've used. The algorithm is the same for all groups at Nikhef. There is a similar conversion for the user number, the scale of the group number being larger than the group one. The two components are added, resulting in a ranking. The jobs that have the highest ranking run first. So jobs are essentially run in this order:

  1. low group usage in the past 8 hours, also low user usage,
  2. low group usage in the past 8 hours, higher user usage,
  3. higher group usage in the past 8 hours, lower user usage,
  4. higher group usage in the past 8 hours, higher user usage.

The system is fair in that it gives priority to jobs from users/groups that haven't used many cycles in the last 8 hours. However the response is not always immediate, the scheduler can not run a new job until another one ends, and if there are more than two groups running, it could be that your particular group has a higher ranking than one but lower than the other.

There is one more consideration in scheduling: the maximum number of running jobs is limited per queue and per user. For example, the long queue is limited to running 312 single-core jobs, not even half of the total capacity; this is done to prevent long jobs from blocking all other use of the cluster for a long period of time. Also, despite the generic queue being able to run a maximum of 390 jobs, a single user cannot run more than 342 jobs. This is to prevent any user from grabbing all the slots in the generic queue for themselves. The numbers are a compromise; if the cluster is empty, you want these numbers to be big; if it is full, you want them to be smaller. There is no good way to automate this, so we set them at a compromise.

Debugging and troubleshooting batch jobs

If you want to debug a problem that occurs on a Stoomboot batch job, or you want to make a short trial run for a larger series of batch jobs, there are two ways to gain interactive login access to Stoomboot.

You can either directly login to the interactive nodes, or you can request an ‘interactive’ batch job through

qsub -I

In this mode you can consume as many CPU resources as allowed by the queue that the interactive job was submitted to (by default this is the generic queue). The ‘look and feel’ of interactive batch jobs is nearly identical to that of using ssh to connect to an interactive node. The main exception is that when no free job slot is available the qsub command will pause until one becomes available.

Note: if trying to run an interactive multicore job, the number of cores is limited to 32 — this is the maximum number of cores available on one node in the Stoomboot cluster.

## Request 8 cores on one node:
qsub -I -l nodes=1:ppn=8

Example test script

This is an example test script that will give you more information about where a job is running. It has an optional sleep condition at the end.

Note this is a Bash script. Scripts written in most popular languages can also be run on the batch system — such as Python (is the program written for Python2 or Python3?), C/C++ (note there will likely be compiler mismatches depending on what your code requires), Perl, Fortran, etc.

#!/bin/bash

: ${SLEEP_AT_END_OF_JOB:=0}

echo "start" > test_file

echo -e ">>>>>>>>>>>>>> User environment\n"
echo -e "\n>>> id"
/usr/bin/id

echo -e "\n>>> pwd"
/bin/pwd
/bin/ls -al

echo -e "\n>>> date"
/bin/date

echo -e "\n>>> env"
/bin/env | sort

echo -e "\n>>> home = $HOME"
ls -al $HOME

echo -e "\n>>> sessiondir = $TMPDIR"
ls -al $TMPDIR

echo -e "\n>>> .bashrc"
cat $HOME/.bashrc

echo -e "\n>>> .bash_profile"
cat $HOME/.bash_profile

echo -e "\n>>> .bash_logout"
cat $HOME/.bash_logout

echo -e "\n>>> hostname"
hostname -f

echo -e "\n>>> mounts"
mount

echo -e "\n>>> quota"
quota

echo -e "\n>>> finished"

if [ $SLEEP_AT_END_OF_JOB -eq 1 ]; then
    echo "Sleeping for a couple of minutes"
    sleep 300
fi
echo "Done"

Contact