CPU batch jobs
Aim: Provide the basics of how to access and use the Stoomboot cluster CPU batch system, i.e. how to submit, monitor and delete batch jobs.
Target audience: Users of the Stoomboot cluster's CPUs.
Introduction
Batch system jobs are used to scale out analysis work — from running a program once or twice to running a program/script tens (or hundreds) of times — to get a solution or results faster.
Prerequisites
- A Nikhef account
- An ssh client (usually pre-installed on Linux and Mac OS X, PuTTY on Windows)
Usage
Accessing the batch system
See the Stoomboot cluster page for more information on the Stoomboot cluster and how to access it.
To interact with the batch system (e.g. submit a job, query a job status or delete a job), you need to login on a Linux machine at Nikhef. Machines that can be used include desktops managed by the CT department, the login hosts and the interactive Stoomboot nodes.
Submitting jobs to the batch system
The command qsub
is used to submit jobs to the cluster. A typical use of this command is:
qsub [-q <queue>] [-l resource_name[=[value]][,resource_name[=[value]],...]] [script]
The optional argument script
is the user-provided script that does the work (see for example the test script below).
A job is always submitted to a queue (ordered (waiting) list of jobs). The section Available queues below provides more information about the available queues and their properties.
Resource requirements and queue selection
Choose the queue that best matches your requirements for the job duration. The scheduler favors jobs shorter in duration over longer jobs, but jobs that overrun the time limit will be killed.
The scheduler will consider all jobs in the system, not just individual queues. It will calculate the priority based on how many resources are needed and on past resource consumption by the user.
The resources that can be requested include:
- the number of processors,
- the amount of memory, and
- the wall clock time of the job.
For example:
## Request 4 cores on 1 node:
-l nodes=1:ppn=4
## Request a walltime of 32 hours, 10 minutes and 5 seconds:
-l walltime=32:10:05
Common command-line options for qsub
-j oe: merge stdout and stderr in a single file; a single “.o” file is written.
-q <queuename>: specify batch queue; the default queue is “generic”
-o <filename>: specify a different filename for stdout
-V: pass all environment variables from the submitting shell to the batch job (with the exception of $PATH)
-l host=v100: run the job on a Stoomboot node with a specific type of GPU
qsub
man page. Available queues
The queues have different parameters for how many jobs you can submit and run at the same time, and how long each job can run for (measured in wall-clock time). Please choose and specify an appropriate queue when submitting a job:
CentOS7 queues | Max. walltime [HH:MM] | Remarks |
---|---|---|
express7 | 00:10 | test jobs (max. 2 running jobs per user) |
short7 | 04:00 | short turnaround jobs |
generic7 | 24:00 | default queue |
long7 | 48:00 | max. walltime of 96:00 via resource list |
You could also enter express, short, generic or long
without the 7
at the end. These are technically routing queues, which are a holdover from when we still had CentOS 6 as an OS in the Stoomboot cluster. Any jobs submitted to a routing queue will be forwarded, or 'routed', to is equivalent destination queue with a '7'.
routing queue | destination queue |
---|---|
express | express7 |
short | short7 |
generic | generic7 |
long | long7 |
Submitting under another group
By default, qsub
uses the primary group of the Unix account when submitting a job. Sometimes you may need to use a different group, for example because the primary group has no allocation on the batch server. (If you're unsure what this means, type id
from a Nikhef node like login.nikhef.nl
or stbc-i1.nikhef.nl
to see what your group assignments are.)
To submit with a different group, add -W group_list=<new-group>
to qsub
. The following example forces the use of the group atlas
:
qsub -W group_list=atlas some-script.sh
What is my jobs' status?
The qstat
command shows the status of all jobs in the system. Status code ‘C’ indicates completed, ‘R’ indicates running and ‘Q’ indicates queued.
qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1001.burrell script.sh user1 21:22:33 R generic7
1002.burrell myscript.csh user2 08:15:38 C long7
1003.burrell myscript.csh user2 00:02:13 R long7
1004.burrell myscript.csh user2 0 Q long7
The above example shows 4 jobs from 2 different users. Two jobs are running (1001 and 1003), one has finished (1002) and one is still waiting in the queue.
Only completed jobs that completed less than 10 minutes ago are listed with status ‘C’. Output of jobs that completed longer ago is kept, but they are no longer listed in the status overview.
More detailed output is shown with qstat -n1
:
qstat -n1
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
1001.burrell.nikhef.n user1 generic7 script.sh 10280 -- -- -- 36:00 R 21:22 stbc-081
1002.burrell.nikhef.n user2 long7 myscript.csh 28649 -- -- -- 96:00 R 08:15 stbc-043
1003.burrell.nikhef.n user2 long7 myscript.csh 12365 -- -- -- 96:00 R 00:02 stbc-028
1004.burrell.nikhef.n user2 long7 myscript.csh -- -- -- -- 96:00 R -- --
The qstat -u <username>
command shows the status of the jobs owned by user username
:
> qstat -u username
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1241010.burrell.nikhef username generic7 test_qsub.sh 27126 -- -- -- 00:10:00 R 00:00:11
More information on qstat
can be found in the manual.
Deleting jobs
Users can delete their own jobs with the command qdel
:
qdel 1001
This example removes the job with ID 1001. The job ID can be obtained with the qstat
command described above.
Storing output data
Not all places are appropriate for your output data. Home directories, /project
, /data
, and dCache all have their strengths and weaknesses. Read about available storage classes on the storage overview page.
Using scratch disk and NFS disk access
When running on the batch system, please be sure to locate all local ‘scratch’ files to the directory pointed to by the environment variable $TMPDIR
and not /tmp
. The latter is very small (a few GiB) and when filled up will give all kinds of problems for you and other users. The disk pointed to by $TMPDIR
is large and fast. This directory will be automatically cleaned up at the end of the job.
When accessing NFS (Network File System) mounted disks (/project
, /data
) please keep in mind that the network bandwidth between Stoomboot nodes and the NFS server is limited and that the NFS server capacity is also limited. Running, for example, 50 jobs that read from or write to files on NFS disks at a high rate (‘ntuple analysis’) may result in poor performance of both the NFS server and your jobs.
General Considerations on submission and scheduling
The Stoomboot batch system is a shared facility, mistakes you make can cause others to be blocked or lose work, so pay attention to these considerations.
If you'll be doing a lot of I/O, think about it (or come talk the CT/PDP group). Some example gotchas:
- large output files (like logs) going to
/tmp
might fill up/tmp
if many of your jobs land on the same node, and this will hang the node. - submitting lots of jobs, each of which opens lots of files, can cause problems on the storage server. Organizing information in thousands of small files is problematic.
- there is an inherent dead time of about 30 seconds associated with each job. Hence very short jobs are horrendously inefficient, they also cause high loads on several services and this sometimes causes problems for others. If your average job run time is less than a minute, please consider how to re-pack your work into jobs.
How scheduling is done
The information below may help you to understand why your jobs are not running, or why there are fewer jobs running than you expected.
The batch system works on a fair-share scheduling basis. Each group at Nikhef gets an equal share allocated, and within each group, all users are equal. The scheduler makes its decision based on a couple pieces of information:
- how much time your group (e.g.,
atlas
,virgo
orkm3net
) has used over the last 8 hours, and - how much time you (e.g.,
templon
,verkerke
orstanb
) have used over the last 8 hours.
The group number is converted to a group ranking component. If your group has used less than the standard share in the last 8 hours, this number is positive, getting larger the less cpu time the group has used. If you have used more than the standard share, the number is negative, getting more negative the more you've used. The algorithm is the same for all groups at Nikhef. There is a similar conversion for the user number, the scale of the group number being larger than the group one. The two components are added, resulting in a ranking. The jobs that have the highest ranking run first. So jobs are essentially run in this order:
- low group usage in the past 8 hours, also low user usage,
- low group usage in the past 8 hours, higher user usage,
- higher group usage in the past 8 hours, lower user usage,
- higher group usage in the past 8 hours, higher user usage.
The system is fair in that it gives priority to jobs from users/groups that haven't used many cycles in the last 8 hours. However the response is not always immediate, the scheduler can not run a new job until another one ends, and if there are more than two groups running, it could be that your particular group has a higher ranking than one but lower than the other.
There is one more consideration in scheduling: the maximum number of running jobs is limited per queue and per user. For example, the long queue is limited to running 312 single-core jobs, not even half of the total capacity; this is done to prevent long jobs from blocking all other use of the cluster for a long period of time. Also, despite the generic queue being able to run a maximum of 390 jobs, a single user cannot run more than 342 jobs. This is to prevent any user from grabbing all the slots in the generic queue for themselves. The numbers are a compromise; if the cluster is empty, you want these numbers to be big; if it is full, you want them to be smaller. There is no good way to automate this, so we set them at a compromise.
Debugging and troubleshooting batch jobs
If you want to debug a problem that occurs on a Stoomboot batch job, or you want to make a short trial run for a larger series of batch jobs, there are two ways to gain interactive login access to Stoomboot.
You can either directly login to the interactive nodes, or you can request an ‘interactive’ batch job through
qsub -I
In this mode you can consume as many CPU resources as allowed by the queue that the interactive job was submitted to (by default this is the generic queue). The ‘look and feel’ of interactive batch jobs is nearly identical to that of using ssh to connect to an interactive node. The main exception is that when no free job slot is available the qsub
command will pause until one becomes available.
Note: if trying to run an interactive multicore job, the number of cores is limited to 32 — this is the maximum number of cores available on one node in the Stoomboot cluster.
## Request 8 cores on one node:
qsub -I -l nodes=1:ppn=8
Example test script
This is an example test script that will give you more information about where a job is running. It has an optional sleep condition at the end.
Note this is a Bash script. Scripts written in most popular languages can also be run on the batch system — such as Python (is the program written for Python2 or Python3?), C/C++ (note there will likely be compiler mismatches depending on what your code requires), Perl, Fortran, etc.
#!/bin/bash
: ${SLEEP_AT_END_OF_JOB:=0}
echo "start" > test_file
echo -e ">>>>>>>>>>>>>> User environment\n"
echo -e "\n>>> id"
/usr/bin/id
echo -e "\n>>> pwd"
/bin/pwd
/bin/ls -al
echo -e "\n>>> date"
/bin/date
echo -e "\n>>> env"
/bin/env | sort
echo -e "\n>>> home = $HOME"
ls -al $HOME
echo -e "\n>>> sessiondir = $TMPDIR"
ls -al $TMPDIR
echo -e "\n>>> .bashrc"
cat $HOME/.bashrc
echo -e "\n>>> .bash_profile"
cat $HOME/.bash_profile
echo -e "\n>>> .bash_logout"
cat $HOME/.bash_logout
echo -e "\n>>> hostname"
hostname -f
echo -e "\n>>> mounts"
mount
echo -e "\n>>> quota"
quota
echo -e "\n>>> finished"
if [ $SLEEP_AT_END_OF_JOB -eq 1 ]; then
echo "Sleeping for a couple of minutes"
sleep 300
fi
echo "Done"
Links
- Using ssh at Nikhef
- Interactive CPU nodes
- Interactive GPU nodes
- GPU batch jobs
- Overview of Stoomboot
- Storage Overview
Contact
- Email stbc-users@nikhef.nl or stbc-admin@nikhef.nl for questions about submitting batch jobs.
- Chat in Nikhef's Mattermost channel for stbc-users.