Qsub tunnel

Aim: Describe how to access and use the Grid cluster from stoomboot

Target audience: Users of the Stoomboot cluster who temporarily need more compute capacity

Introduction

There is a special functionality that has been created for local Nikhef users to submit to the Nikhef Grid cluster (also known as NDPF, or the Nikhef Data Processing Facility cluster) as if a user were submitting to the Stoomboot cluster. The qsub tunnel allows users access to more compute, networking and storage resources for larger-scale workflows. However, this comes with some additional complexity.

Note, this is an experimental service! There is no service level defined of the qsub tunnel, and it is not guaranteed to be available or to suit any/all purposes. This service is open to selected users. In order to gain access, users must send a request to grid.sysadmin@nikhef.nl and ask their account to be enabled for this service.

Usage

Setting up and using directories

Email grid.sysadmin@nikhef.nl to ask for the creation of a directory for the qsub tunnel service. Your Nikhef username should appear in the /data/tunnel/user/[username] file system that is accessible from the interactive nodes.

Jobs can be submitted to the Grid batch system from the interactive nodes.

The tunnel directory /data/tunnel/user/[username] is for the exchange of log files and stdout/stderr only from your job submissions. Do not use it for data transfer; quota on this file system is intentionally limited to 256 MB per user.

Note: neither your desktop nor Nikhef home directory is available on the Grid cluster. So you will need to make use of the /data/tunnel file system for logs and stdout/stderr, and dCache for storing output of your jobs (more information on this below).

Certificates and proxies

Get a valid proxy with voms-proxy-init -voms [your_experiment]. See the explaination on the Grid jobs page.

Test storage access

With your proxy generated, use the xrootd or WebDAV tools to access the fast bulk storage from your job. For example, by trying

## Set the environment variable for your X509 user proxy file stored in /tmp.
export X509_USER_PROXY=/tmp/x509up_u$(id -u)
## This command needs to be run every time  there's a new session unless it is placed in a ~/.bashrc or ~/.bash_profile file

## List directories in the data folder. This will test if your authentication is working correctly.
davix-ls --cert $X509_USER_PROXY https://dcache.nikhef.nl:2880/dcache/test/

## Test command for copying files using xrootd.
xrdcp ./myfile.dat xroot://dcache.nikhef.nl/dcache/test/

Submitting batch jobs

nsub (/global/ices/toolset/bin/nsub) is the utility you want to use for submitting jobs. It has qsub command plus some extra scripts to make your submission to the Grid cluster easier.

[maryh@stbc-i2 ~]$ /global/ices/toolset/bin/nsub --help
The nsub utility submits batch scripts from ikohefnet hosts to the
NDPF using the shared file system. The jobs can be submitted from selected
hosts on the ikohef network and will run on the NDPF clusters.

Usage: nsub [-o output] [-e error] [-q queue] [-v] [--quiet] [-I]
      [-t njobs] [-N name] [-f proxyfile] [-c rcfile] <job> ...

Arguments:
  -o outputfile            destination of stdout, must be in sharedfs
  -e errorfile             destination of stderr, must be in sharedfs
  -q queue                 queue to which to submit on korf.nikhef.nl
  -v/--quiet               be more verbose/do not warn about relocations
  -N name                  name on the job in the pbs queue
  -I                       interactive jobs are NOT SUPPORTED
  -n                       do not actually submit job (dryrun)
  -c rcfile                additional definitions and defaults
  <job> ...                more than one job may be submitted to PBS,
                           (no -o or -e possible if more than one job given)
Defaults:
  shared file system       /data/tunnel
  userdirectory            /data/tunnel/user/maryh
  pbs server               korf.nikhef.nl
  queue                    medium

Fast local disk space is available on each worker node. Use $TMPDIR for all your data storage and cache needs—and not your home! The size of the TMPDIR is more than 50GB per job slot. Once the job is finished, this directory is purged.

By default, the nsub utility will submit to this medium queue. This queue is called medium and has a maximum wall clock time of 36 hours.

Example job script

A typical job script looks like:

#! /bin/sh
#
## @(#)/user/davidg/ndpf-example.sh
#
#PBS -q medium@korf.nikhef.nl
#
cd $TMPDIR
xrdcp  xroot://dcache.nikhef.nl/dcache/test/maryh/welcome.txt `pwd`/welcome.txt
mv welcome.txt goodbye.txt
xrdcp --force `pwd`/goodbye.txt xroot://dcache.nikhef.nl/dcache/test/goodbye.txt
dt=`date`
nodes=`cat $PBS_NODEFILE`
echo "Copied the welcome to the goodbye file on $dt!"
echo "and used hosts $nodes for that"

which can be submitted to the Grid batch system with a command like:

/global/ices/toolset/bin/nsub /user/davidg/ndpf-example.sh

In the Grid cluster, you will see the tunnel file system and the experiment's software distributions usually available via CernVMFS (for example, check what is available with an ls command on an interactive node: ls /cvmfs/atlas.cern.ch/).

At the end of the job, write the results to dCache storage using xrootd or WebDAV. More information can be found on the dCache storage page. Note that you cannot overwrite files in dCache storage—select unique names or remove your old data with a command like:

## xrdfs host[:port] [command [args]]
xrdfs xroot://dcache.nikhef.nl/dcache/test/maryh/welcome.txt rm

Check xrdfs --help for more options.

Checking the status of your jobs

From interactive nodes, use qstat @korf to query the Grid cluster head node (korf.nikhef.nl).

From login.nikhef.nl, use /opt/torque4/bin/qstat-torque @korf.

Note, this will not return information about other jobs—only the jobs you have submitted and only when you have submitted a job.

Environment variables

The nsub wrapper will override the following environment variables to match the specific needs of the Grid cluster. If you encounter some strange errors, you may want to review your setup to see if you need any of these variables for a specific purpose.

LCG_GFAL_INFOSYS
GLITE_LOCATION
GLOBUS_TCP_PORT_RANGE
GLITE_LOCATION_VAR
GLITE_SD_SERVICES_XML
EDG_LOCATION
EDG_WL_LOCATION
GLITE_SD_PLUGIN
GLITE_WMS_LOCATION
GLITE_LOCAL_CUSTOMIZATION_DIR
X509_CERT_DIR
X509_VOMS_DIR

If you are not using the nsub wrapper, source the setup script via executing

. /global/ices/lcg/current/etc/profile.d/grid-env.sh

at the top of your job script.

Links

Contact

Email grid.sysadmin@nikhef.nl.