Qsub tunnel
Aim: Describe how to access and use the Grid cluster from stoomboot
Target audience: Users of the Stoomboot cluster who temporarily need more compute capacity
Introduction
There is a special functionality that has been created for local Nikhef users to submit to the Nikhef Grid cluster (also known as NDPF, or the Nikhef Data Processing Facility cluster) as if a user were submitting to the Stoomboot cluster. The qsub tunnel allows users access to more compute, networking and storage resources for larger-scale workflows. However, this comes with some additional complexity.
Note, this is an experimental service! There is no service level defined of the qsub tunnel, and it is not guaranteed to be available or to suit any/all purposes. This service is open to selected users. In order to gain access, users must send a request to grid.sysadmin@nikhef.nl and ask their account to be enabled for this service.
Usage
Setting up and using directories
Email grid.sysadmin@nikhef.nl to ask for the creation of a directory for the qsub tunnel service. Your Nikhef username should appear in the /data/tunnel/user/[username] file system that is accessible from the interactive nodes.
Jobs can be submitted to the Grid batch system from the interactive nodes.
The tunnel directory /data/tunnel/user/[username] is for the exchange of log files and stdout/stderr only from your job submissions. Do not use it for data transfer; quota on this file system is intentionally limited to 256 MB per user.
Note: neither your desktop nor Nikhef home directory is available on the Grid cluster. So you will need to make use of the /data/tunnel file system for logs and stdout/stderr, and dCache for storing output of your jobs (more information on this below).
Certificates and proxies
Get a valid proxy with voms-proxy-init -voms [your_experiment]
. See the explaination on the Grid jobs page.
Test storage access
With your proxy generated, use the xrootd or WebDAV tools to access the fast bulk storage from your job. For example, by trying
## Set the environment variable for your X509 user proxy file stored in /tmp.
export X509_USER_PROXY=/tmp/x509up_u$(id -u)
## This command needs to be run every time there's a new session unless it is placed in a ~/.bashrc or ~/.bash_profile file
## List directories in the data folder. This will test if your authentication is working correctly.
davix-ls --cert $X509_USER_PROXY https://dcache.nikhef.nl:2880/dcache/test/
## Test command for copying files using xrootd.
xrdcp ./myfile.dat xroot://dcache.nikhef.nl/dcache/test/
Submitting batch jobs
nsub
(/global/ices/toolset/bin/nsub) is the utility you want to use for submitting jobs. It has qsub command plus some extra scripts to make your submission to the Grid cluster easier.
[maryh@stbc-i2 ~]$ /global/ices/toolset/bin/nsub --help
The nsub utility submits batch scripts from ikohefnet hosts to the
NDPF using the shared file system. The jobs can be submitted from selected
hosts on the ikohef network and will run on the NDPF clusters.
Usage: nsub [-o output] [-e error] [-q queue] [-v] [--quiet] [-I]
[-t njobs] [-N name] [-f proxyfile] [-c rcfile] <job> ...
Arguments:
-o outputfile destination of stdout, must be in sharedfs
-e errorfile destination of stderr, must be in sharedfs
-q queue queue to which to submit on korf.nikhef.nl
-v/--quiet be more verbose/do not warn about relocations
-N name name on the job in the pbs queue
-I interactive jobs are NOT SUPPORTED
-n do not actually submit job (dryrun)
-c rcfile additional definitions and defaults
<job> ... more than one job may be submitted to PBS,
(no -o or -e possible if more than one job given)
Defaults:
shared file system /data/tunnel
userdirectory /data/tunnel/user/maryh
pbs server korf.nikhef.nl
queue medium
Fast local disk space is available on each worker node. Use $TMPDIR
for all your data storage and cache needs—and not your home! The size of the TMPDIR is more than 50GB per job slot. Once the job is finished, this directory is purged.
By default, the nsub
utility will submit to this medium queue. This queue is called medium
and has a maximum wall clock time of 36 hours.
Example job script
A typical job script looks like:
#! /bin/sh
#
## @(#)/user/davidg/ndpf-example.sh
#
#PBS -q medium@korf.nikhef.nl
#
cd $TMPDIR
xrdcp xroot://dcache.nikhef.nl/dcache/test/maryh/welcome.txt `pwd`/welcome.txt
mv welcome.txt goodbye.txt
xrdcp --force `pwd`/goodbye.txt xroot://dcache.nikhef.nl/dcache/test/goodbye.txt
dt=`date`
nodes=`cat $PBS_NODEFILE`
echo "Copied the welcome to the goodbye file on $dt!"
echo "and used hosts $nodes for that"
which can be submitted to the Grid batch system with a command like:
In the Grid cluster, you will see the tunnel file system and the experiment's software distributions usually available via CernVMFS (for example, check what is available with an ls
command on an interactive node: ls /cvmfs/atlas.cern.ch/
).
At the end of the job, write the results to dCache storage using xrootd or WebDAV. More information can be found on the dCache storage page. Note that you cannot overwrite files in dCache storage—select unique names or remove your old data with a command like:
## xrdfs host[:port] [command [args]]
xrdfs xroot://dcache.nikhef.nl/dcache/test/maryh/welcome.txt rm
Check xrdfs --help
for more options.
Checking the status of your jobs
From interactive nodes, use qstat @korf
to query the Grid cluster head node (korf.nikhef.nl).
From login.nikhef.nl, use /opt/torque4/bin/qstat-torque @korf
.
Note, this will not return information about other jobs—only the jobs you have submitted and only when you have submitted a job.
Environment variables
The nsub wrapper will override the following environment variables to match the specific needs of the Grid cluster. If you encounter some strange errors, you may want to review your setup to see if you need any of these variables for a specific purpose.
LCG_GFAL_INFOSYS
GLITE_LOCATION
GLOBUS_TCP_PORT_RANGE
GLITE_LOCATION_VAR
GLITE_SD_SERVICES_XML
EDG_LOCATION
EDG_WL_LOCATION
GLITE_SD_PLUGIN
GLITE_WMS_LOCATION
GLITE_LOCAL_CUSTOMIZATION_DIR
X509_CERT_DIR
X509_VOMS_DIR
If you are not using the nsub wrapper, source the setup script via executing
at the top of your job script.
Links
Contact
- Email grid.sysadmin@nikhef.nl.