Running Jobs on Katana¶
Brief Overview¶
The Login Node of a cluster is a shared resource for all users and is used for preparing, submitting and managing jobs.
Note
Never run any computationally intensive processes on the login nodes.
Jobs are submitted from the login node, which delivers them to the Head Node for job and resource management. Once the resources have been allocated and are available, the job will run on one or more of the compute nodes as requested.
Different clusters use different tools to manage resources and schedule jobs - OpenPBS and SLURM are two popular systems. Katana, like NCI’s Gadi, uses OpenPBS for this purpose.
Jobs are submitted using the qsub
command. There are two types of job that qsub
will accept: an Interactive Job and a Batch Job. Regardless of type, the resource manager will put your job in a Queue.
An interactive job provides a shell session on a Compute Nodes. You interact directly with the compute node running the software you need explicitly. Interactive jobs are useful for experimentation, debugging, and planning for batch jobs.
Note
For calculations that run longer than a few hours, batch jobs are preferred.
In contrast, a Batch Job is a scripted job that - after submission via qsub
- runs from start to finish without any user intervention. The vast majority of jobs on the cluster are batch jobs. This type of job is appropriate for production runs that will consume several hours or days.
To submit a Batch Job you will need to create a job script which specifies the resources that your job requires and calls your program. The general structure of A Job Script is shown below.
Important
All jobs go into a Queue while waiting for resources to become available. The length of time your jobs wait in a queue for resources depends on a number of factors.
The main resources available for use are Memory (RAM), CPU Core (number of CPUs) and Walltime (how long you want the CPUs for). These need to be considered carefully when writing your job script, since the decisions you make will impact which queue your jobs ends up on.
As you request more memory, CPU cores, or walltime, the number of available queues goes down. The limits are which the number of queues decrease are summarised in the table below
Job queue limits summary¶
Typical job queue limit cut-offs are shown below. The walltime is what determines whether a job can be run on any node, or only on a restricted set of nodes.
Resource |
Queue limit cut-offs |
|||||
---|---|---|---|---|---|---|
Memory (GB) |
124 |
180 |
248 |
370 |
750 |
1000 |
CPU Cores |
16 |
20 |
24 |
28 |
32 |
44 |
Walltime (hrs) |
12 |
48 |
100 |
200 |
||
Any node |
School-owned or general-use nodes |
School-owned nodes only |
Note
Try to combine or divide batch jobs to fit within that 12 hour limit for fastest starting times.
The resources available on a specific compute node can be shown with the qstat command.
Interactive Jobs¶
An interactive job or interactive session is a session on a compute node with the required physical resources for the period of time requested. To request an interactive job, add the -I flag (capital i) to qsub
. Default sessions will have 1 CPU core, 1GB and 1 hour
For example, the following two commands. The first provides a default session, the second provides a session with two CPU core and 8GB memory for three hours. You can tell when an interactive job has started when you see the name of the server change from katana1
or katana2
to the name of the server your job is running on. In these cases it’s k181
and k201
respectively.
[z1234567@katana1 ~]$ qsub -I
qsub: waiting for job 313704.kman.restech.unsw.edu.au to start
qsub: job 313704.kman.restech.unsw.edu.au ready
[z1234567@k181 ~]$
[z1234567@katana2 ~]$ qsub -I -l select=1:ncpus=2:mem=8gb,walltime=3:00:00
qsub: waiting for job 1234.kman.restech.unsw.edu.au to start
qsub: job 1234.kman.restech.unsw.edu.au ready
[z1234567@k201 ~]$
Jobs are constrained by the resources that are requested. In the previous example the first job - running on k181
- would be terminated after 1 hour or if a command within the session consumed more than 8GB memory. The job (and therefore the session) can also be terminated by the user with CTRL-D
or the logout
command.
Interactive jobs can be particularly useful while developing and testing code for a future batch job, or performing an interactive analysis that requires significant compute resources. Never attempt such tasks on the login node – submit an interactive job instead.
Batch Jobs¶
A batch job is a script that runs autonomously on a compute node. The script must contain the necessary sequence of commands to complete a task independently of any input from the user. This section contains information about how to create and submit a batch job on Katana.
Getting Started¶
The following script simply executes a pre-compiled program (“myprogram”) in the user’s home directory:
#!/bin/bash
cd $HOME
./myprogram
This script can be submitted to the cluster with qsub
and it will become a job and be assigned to a queue. If the script is in a file called myjob.pbs
then the following command will submit the job with the default resource requirements (1 CPU core with 1GB of memory for 1 hour):
[z1234567@katana1 ~]$ qsub myjob.pbs
1237.kman.restech.unsw.edu.au
As with interactive jobs, the -l
(lowercase L) flag can be used to specify resource requirements for the job:
[z1234567@katana ~]$ qsub -l select=1:ncpus=1:mem=4gb,walltime=12:00:00 myjob.pbs
1238.kman.restech.unsw.edu.au
If we wanted to use the GPU resources, we would write something like this - note that because of configuration of machines, you should request: ncpus=(#ngpus*8):mem=(#ngpus*46)
[z1234567@katana ~]$ qsub -l select=1:ncpus=8:ngpus=1:mem=46gb,walltime=12:00:00 myjob.pbs
1238.kman.restech.unsw.edu.au
A Job Script¶
Job scripts offer a much more convenient method for invoking any of the options that can be passed to qsub
on the command-line. In a shell script, a line starting with # is a comment and will be ignored by the shell interpreter. However, in a job script, a line starting with #PBS can be used to pass options to the qsub
command.
Here is an overview of the different parts of a job script which we will examine further below. In the following sections we will add some code, explain what it does, then show some new code, and iterate up to something quite powerful.
For the previous example, the job script could be rewritten as:
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=4gb
#PBS -l walltime=12:00:00
cd $HOME
./myprogram
Warning
This structure is the most common that you will use. The top line must be #!/bin/bash
- we are running bash scripts, and this is required.
The following section - the lines starting with #PBS
- are where we will be configuring how the job will be run - here we are asking for resources.
The final section shows the commands that will be executed in the configured session.
The script can now be submitted with much less typing:
[z1234567@katana ~]$ qsub myjob.pbs
1239.kman.restech.unsw.edu.au
Unlike submission of an interactive job, which results in a session on a compute node ready to accept commands, the submission of a batch job returns the ID of the new job. This is confirmation that the job was submitted successfully. The job is now processed by the job scheduler and resource manager. Commands for checking the status of the job can be found in the section Managing Jobs on Katana.
If you wish to be notified by email when the job finishes then use the -M
flag to specify the email address and the -m
flag to declare which events cause a notification. Here we will get an email if the job aborts (-m a
) due to an error or ends (-m e
) naturally.
#PBS -M your.name.here@unsw.edu.au
#PBS -m ae
The output that would normally go to screen and error messages of a batch job will be saved to file when your job ends. By default these files will be called JOB_NAME.oJOB_ID
and JOB_NAME.eJOB_ID
, and they will appear in the directory that was the current working directory when the job was submitted. In the above example, they would be myjob.o1239
and myjob.e1239
. You can merge these into a single file with the -j oe
flag. The -o
flag allows you to rename the file.
#PBS -j oe
#PBS -o /home/z1234567/results/Output_Report
When a job starts, it needs to know where to save its output and do its work. This is called the current working directory. By default the job scheduler will make your current working directory your home directory (/home/z1234567
). This isn’t likely or ideal and is important that each job sets its current working directory appropriately. There are a couple of ways to do this, the easiest is to set the current working directory to the directory you are in when you execute qsub
by using
cd $PBS_O_WORKDIR
There is one last special variable you should know about, especially if you are working with large datasets. The storage on the compute node your job is running on will always be faster than the network drive.
If you use the storage close to the CPUs - in the server rather than on the shared drives, called Local Scratch - you can often save hours of time reading and writing across the network.
In order to do this, you can copy data to and from the local scratch, called $TMPDIR
:
cp /home/z1234567/project/massivedata.tar.gz $TMPDIR
tar xvf massivedata.tar.gz
my_analysis.py massive_data
cp -r $TMPDIR/my_output /home/z1234567
There are a lot of things that can be done with PBSPro, but you don’t and won’t need to know it all. These few basics will get you started.
Here’s the full script as we’ve described. You can copy this into a text editor and once you’ve changed our dummy values for yours, you only need to change the last line.
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=4gb
#PBS -l walltime=12:00:00
#PBS -M your.name.here@unsw.edu.au
#PBS -m ae
#PBS -j oe
#PBS -o /home/z1234567/results/Output_Report
cd $PBS_O_WORKDIR
./myprogram
Array Jobs¶
One common use of computational clusters is to do the same thing multiple times - sometimes with slightly different input, sometimes to get averages from randomness within the process. This is made easier with array jobs.
An array job is a single job script that spawns many almost identical sub-jobs. The only difference between the sub-jobs is an environment variable $PBS_ARRAY_INDEX
whose value uniquely identifies an individual sub-job. A regular job becomes an array job when it uses the #PBS -J
flag.
For example, the following script will spawn 100 sub-jobs. Each sub-job will require one CPU core, 1GB memory and 1 hour run-time, and it will execute the same application. However, a different input file will be passed to the application within each sub-job. The first sub-job will read input data from a file called 1.dat
, the second sub-job will read input data from a file called 2.dat
and so on.
Note
In this example we are using brace expansion - the {} characters around the bash variables - because they are needed for variables that change, like array indices. They aren’t strictly necessary for $PBS_O_WORKDIR
but we include them to show consistency.
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=1gb
#PBS -l walltime=1:00:00
#PBS -j oe
#PBS -J 1-100
cd ${PBS_O_WORKDIR}
./myprogram ${PBS_ARRAY_INDEX}.dat
There are some more examples of array jobs including how to group your computations in an array job on the UNSW Github HPC examples page.
Splitting large Batch Jobs¶
If your batch job can be split into multiple steps you may want to split one big job up into a number of smaller jobs. There are a number of reasons to spend the time to implement this.
If your large job runs for over 200 hours, it won’t finish on Katana.
If your job has multiple steps which use different amounts of resources at each step. If you have a pipeline that takes 50 hours to run and needs 200GB of memory for an hour, but only 50GB the rest of the time, then the memory is sitting idle.
Katana has prioritisations based on how many resources any one user uses. If you ask for 200GB of memory, this will be accounted for when working out your next job’s priority.
Because there are many more resources for 12 hour jobs, seven or eight 12 hour jobs will often finish well before a single 100 hour job even starts.
Get information about the state of the scheduler¶
When deciding which jobs to run, the scheduler takes the following details into account:
are there available resources
how recently has this user run jobs successfully
how many resources has this user used recently
how long is the job’s Walltime
how long has the job been in the queue
You can get an overview of the compute nodes and a list of all the jobs running on each node using pstat
[z1234567@katana2 src]$ pstat
k001 normal-mrcbio free 12/44 200/1007gb 314911*12
k002 normal-mrcbio free 40/44 56/ 377gb 314954*40
k003 normal-mrcbio free 40/44 375/ 377gb 314081*40
k004 normal-mrcbio free 40/44 62/ 377gb 314471*40
k005 normal-ccrc free 0/32 0/ 187gb
k006 normal-physics job-busy 32/32 180/ 187gb 282533*32
k007 normal-physics job-busy 32/32 180/ 187gb 284666*32
k008 normal-physics free 0/32 0/ 187gb
k009 normal-physics job-busy 32/32 124/ 187gb 314652*32
k010 normal-physics free 0/32 0/ 187gb
To get information about a particular node, you can use pbsnodes
but on its own it is a firehose. Using it with a particular node name is more effective:
[z1234567@katana2 src]$ pbsnodes k254
k254
Mom = k254
ntype = PBS
state = job-busy
pcpus = 32
jobs = 313284.kman.restech.unsw.edu.au/0, 313284.kman.restech.unsw.edu.au/1, 313284.kman.restech.unsw.edu.au/2
resources_available.arch = linux
resources_available.cpuflags = avx,avx2,avx512bw,avx512cd,avx512dq,avx512f,avx512vl
resources_available.cputype = skylake-avx512
resources_available.host = k254
resources_available.mem = 196396032kb
resources_available.ncpus = 32
resources_available.node_weight = 1
resources_available.normal-all = Yes
resources_available.normal-qmchda = Yes
resources_available.normal-qmchda-maths_business-maths = Yes
resources_available.normal-qmchda-maths_business-maths-general = Yes
resources_available.vmem = 198426624kb
resources_available.vnode = k254
resources_available.vntype = compute
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 50331648kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 32
resources_assigned.ngpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Thu Apr 30 08:06:23 2020
last_used_time = Thu Apr 30 07:08:25 2020
Managing Jobs on Katana¶
Once you have jobs running, you will want visibility of the system so that you can manage them - delete jobs, change jobs, check that jobs are still running.
There are a couple of easy to use commands that help with this process.
qstat¶
Show all jobs on the system¶
qstat
gives very long output. Consider piping to less
[z1234567@katana2 ~]$ qstat | less
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
245821.kman s-m20-i20-200h z1234567 0 Q medicine200
280163.kman Magcomp25A2 z1234567 3876:18: R mech700
282533.kman Proj_MF_Nu1 z1234567 3280:08: R cosmo200
284666.kman Proj_BR_Nu1 z1234567 3279:27: R cosmo200
308559.kman JASASec55 z1234567 191:21:3 R maths200
309615.kman 2020-04-06.BUSC z1234567 185:00:5 R babs200
310623.kman Miaocyclegan z1234567 188:06:3 R simigpu200
...
List just my jobs¶
You can use either your ZID or the Environment Variable $USER
[z2134567@katana2 src]$ qstat -u $USER
kman.restech.unsw.edu.au:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
315230.kman.res z2134567 general1 job.pbs -- 1 1 1gb 01:00 Q --
If you add the -s
flag, you will get slightly more status information.
[z1234567@katana2 src]$ qstat -su z1234567
kman.restech.unsw.edu.au:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
315230.kman.res z1234567 general1 job.pbs 61915 1 1 1gb 01:00 R 00:03
Job run at Fri May 01 at 14:28 on (k019:mem=1048576kb:ncpus=1:ngpus=0)
315233.kman.res z1234567 general1 job.pbs -- 1 1 1gb 01:00 Q --
--
List information about a particular job¶
[z1234567@katana2 src]$ qstat -f 315236
Job Id: 315236.kman.restech.unsw.edu.au
Job_Name = job.pbs
Job_Owner = z1234567@katana2
job_state = Q
queue = general12
server = kman.gen
Checkpoint = u
ctime = Fri May 1 14:41:00 2020
Error_Path = katana2:/home/z1234567/src/job.pbs.e315236
group_list = GENERAL
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Fri May 1 14:41:00 2020
Output_Path = katana2:/home/z1234567/src/job.pbs.o315236
Priority = 0
qtime = Fri May 1 14:41:00 2020
Rerunable = True
Resource_List.ib = no
Resource_List.mem = 1gb
Resource_List.ncpus = 1
Resource_List.ngpus = 0
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.select = 1:mem=1gb:ncpus=1
Resource_List.walltime = 01:00:00
substate = 10
Variable_List = PBS_O_HOME=/home/z1234567,PBS_O_LANG=en_AU.UTF-8,
PBS_O_LOGNAME=z1234567,
PBS_O_PATH=/home/z1234567/bin:/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:
/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin,PBS_O_M
AIL=/var/spool/mail/z1234567,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/home
/z1234567/src,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=submission,PBS_O_HOST=kat
ana2
etime = Fri May 1 14:41:00 2020
eligible_time = 00:00:00
Submit_arguments = -W group_list=GENERAL -N job.pbs job.pbs.JAZDNgL
project = _pbs_project_default
qdel¶
Remove a job from the queue or kill it if it’s started. To remove an array job, you must include the square braces and they will need to be escaped. In that situation you use qdel 12345\[\]
. Uses the $JOBID
[z1234567@katana2 src]$ qdel 315252
qalter¶
Once a job has been submitted, it can be altered. However, once a job begins execution, the only values that can be modified are cputime
, walltime
, and run_count
. These can only be reduced.
Users can only lower resource requests on queued jobs. If you need to increase resources, contact a systems administrator. In this example you will see the resources change - but not the Submit_arguments
[z1234567@katana2 src]$ qsub -l select=1:ncpus=2:mem=128mb job.pbs
315259.kman.restech.unsw.edu.au
[z1234567@katana2 src]$ qstat -f 315259
Job Id: 315259.kman.restech.unsw.edu.au
...
Resource_List.mem = 128mb
Resource_List.ncpus = 2
...
Submit_arguments = -W group_list=GENERAL -N job.pbs -l select=1:ncpus=2:mem=128mb job.pbs.YOOu3lB
project = _pbs_project_default
[z1234567@katana2 src]$ qalter -l select=1:ncpus=4:mem=512mb 315259; qstat -f 315259
Job Id: 315259.kman.restech.unsw.edu.au
...
Resource_List.mem = 512mb
Resource_List.ncpus = 4
...
Submit_arguments = -W group_list=GENERAL -N job.pbs -l select=1:ncpus=2:mem=128mb job.pbs.YOOu3lB
project = _pbs_project_default
Tips for using PBS and Katana effectively¶
Keep your jobs under 12 hours if possible¶
If you request more than 12 hours of WALLTIME
then you can only use the nodes bought by your school or research group. Keeping your job’s run time request under 12 hours means that it can run on any node in the cluster.
Important
Two 10 hour jobs will probably finish sooner that one 20 hour job.
In fact, if there is spare capacity on Katana, which there is most of the time, six 10 hours jobs will finish before a single 20 hour job will. Requesting more resources for your job decreases the places that the job can run
The most obvious example is going over the 12 hour limit which limits the number of compute nodes that your job can run on. For example specifying the CPU in your job script restricts you to the nodes with that CPU. A job that requests 20Gb will run on a 128Gb node with a 100Gb job already running but a 30Gb job will not be able to.
Running your jobs interactively makes it hard to manage multiple concurrent jobs¶
If you are currently only running jobs interactively then you should move to batch jobs which allow you to submit more jobs which then start, run and finish automatically. If you have multiple batch jobs that are almost identical then you should consider using array jobs
If your batch jobs are the same except for a change in file name or another variable then you should have a look at using array jobs.