How should my script look?
A job script consist of PBS directives (#PBS), comments and executable statements. A directive provides a way of specifying job attributes in addition to the command line options. The attributes we tend to use are walltime, nodes, number of processors. But there are many others see the man page of pbs_resources_linux for further information.
Examples of how to submit a job script:
Command line directives:
qsub -l nodes=15,walltime=2:00:00 script
or
Script file using directives:
#PBS -l nodes=15,walltime=2:00:00
How to submit a Job
I. Using the command qsub users can submit a script file.
II. Example of a script file:
vi hello.sh
#!/bin/sh
# This is a simple example of an Torque script
#PBS -N sample
cd $HOME/
./hello-world
III. To submit this script type
[node0000]$ qsub hello.sh
You will get output on the screen showing the job id number.hostname (i.e. 0.hostname)
How to check the status of my job
I. Using the command qstat users can check the status of their job. Look for the job id number.
[node0000]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
0.hostname hello.sh jackie 00:00:00 R batch
Note: See the man pages for further options to qstat
II. Using the command tracejob users can check the status of their job.
[node0000]$ tracejob 0
Job: 0.hostname
03/02/2007 14:37:26 S Job Queued at request of jackie@hostname, owner = jackie@hostname, job name = hello.sh, queue
= batch
03/02/2007 14:37:26 A queue=batch
03/02/2007 14:37:27 S Job Modified at request of Scheduler@hostname
03/02/2007 14:37:27 L Job Run
03/02/2007 14:37:27 S Job Run at request of Scheduler@hostname
03/02/2007 14:37:27 A user=jackie group=staff jobname=hello.sh queue=batch ctime=1172875046 qtime=1172875046 etime=1172875046
start=1172875047 exec_host=node0172/0+node0001/0+node0000/0 Resource_List.nodect=3
03/02/2007 14:37:51 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=708kb resources_used.vmem=11036kb
resources_used.walltime=00:00:22
03/02/2007 14:37:51 A user=jackie group=staff jobname=hello.sh queue=batch ctime=1172875046 qtime=1172875046 etime=1172875046
start=1172875047 exec_host=node0000/0 session=0 end=1172875071 Exit_status=0
resources_used.cput=00:00:00 resources_used.mem=708kb resources_used.vmem=11036kb
resources_used.walltime=00:00:22
How to script for a parallel run:
vi mpi.sh
#!/bin/bash
#PBS -l nodes=16:ppn=2
cd /home/jackie/QSUB
## Example 1: "This is an example for MPI using mpi under topspin"
/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -hostfile $PBS_NODEFILE -np 32 ./mpi-hello
## Example 2: "This is an example of MPI running openmpi located in /usr/bin"
/usr/bin/mpirun --hostfile $PBS_NODEFILE -np 32 ./mpi-hello
## Example 3: "This is an example of lamboot running located in /usr/bin"
/usr/bin/lamboot
/usr/bin/mpirun -hostfile $PBS_NODEFILE -np 32 ./mpi-hello
/usr/bin/lamhalt
## Example 4: "This is an example of using open-mpi using modules"
. /usr/Modules/init/bash
module load open-mpi/1.2.6-gcc
mpirun --hostfile $PBS_NODEFILE -np 32 ./mpi-hello
Now save the file.
You will submit this just like the other job. You might have to specify which queue to submit your job to. So read further to see what queues are available.
How to check the queues that are available :
[node0000]$ qstat -Q
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext Type
---------------- --- --- --- --- --- --- --- --- --- --- ----------
batch 0 0 yes yes 0 0 0 0 0 0 Execution
regular 0 0 yes yes 0 0 0 0 0 0 Execution
NOTE: If there is more than one Execution queue listed then you will be required to specify the queue you want to submit your job to. If not the server will have a default queue set to batch and all jobs will go to the batch queue.
[node0000]$ qsub -q regular script
This example the job script will be sent to the regular queue. You can substitute regular for batch and it will go to the batch queue. Each of these queues are configured differently based on the limitations they have.
How to check what limits a queue has?
[node0000]$ qstat -q
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
batch -- -- -- 4 3 0 -- E R
regular -- -- 48:00:00 -- 1 20 -- E R
In this example the batch queue has a limit of 4 nodes only and the regular queue has a 48 hour wallclock time limit. So jobs submitted to that queue must complete within the 48 hours or it will be killed. Jobs going to the batch queue must request no more than 4 nodes or it will be rejected.
How do I delete a job?
In order to delete a job first find out the job id number:
[node0000]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1290.node0000 job.test jackie 12:00:00 R regular
1291.node0000 job2.test jackie 01:00:00 R batch
1292.node0000 job3.test jackie 00:45:00 R batch
1293.node0000 job4.test jackie 00:30:00 R batch
I want to delete job 1293. HOW????
[node0000]$ qdel 1293
Make sure the job was deleted.
[node0000]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1290.node0000 job.test jackie 12:00:00 R regular
1291.node0000 job2.test jackie 01:00:00 R batch
1292.node0000 job3.test jackie 00:45:00 R batch
It has been deleted because it is no longer listed in the output above.
How to check if a node is available:
pbsnodes -a
This will show you the state of all the nodes.
# pbsnodes -a
node0000
state = free
np = 4
properties = shared
ntype = cluster
status = opsys=linux,uname=Linux node0000 2.6.9-11.ELsmp #1 SMP Fri May 20 18:25:30 EDT 2005 x86_64,sessi
ons=? 0,nsessions=? 0,nusers=0,idletime=27114,totmem=8158988kb,availmem=8076976kb,physmem=8158988kb,ncpus=4,lo
adave=0.00,netload=64785310,state=free,jobs=? 0,rectime=1174679615
node0001
state = free
np = 4
properties = shared
ntype = cluster
status = opsys=linux,uname=Linux node0001 2.6.9-11.ELsmp #1 SMP Fri May 20 18:25:30 EDT 2005 x86_64,sessi
ons=? 0,nsessions=? 0,nusers=0,idletime=27104,totmem=8158988kb,availmem=8076820kb,physmem=8158988kb,ncpus=4,lo
adave=0.00,netload=64538012,state=free,jobs=? 0,rectime=1174679599