How should my script look?


A job script consist of PBS directives (#PBS), comments and executable statements. A directive provides a way of specifying job attributes in addition to the command line options. The attributes we tend to use are walltime, nodes, number of processors. But there are many others see the man page of pbs_resources_linux for further information.
   

    Examples of how to submit a job script:

    Command line directives:

	 	 qsub -l nodes=15,walltime=2:00:00 script 

    or 
   
   Script file using directives:
   
     	#PBS -l nodes=15,walltime=2:00:00

How to submit a Job


I. Using the command qsub users can submit a script file.

II. Example of a script file:

vi hello.sh

#!/bin/sh
# This is a simple example of an Torque script
#PBS -N sample
cd $HOME/
./hello-world

III. To submit this script type
  [node0000]$  qsub hello.sh  

You will get output on the screen showing the job id number.hostname (i.e. 0.hostname)

How to check the status of my job


I. Using the command qstat users can check the status of their job. Look for the job id number.

  [node0000]$  qstat 

Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
0.hostname hello.sh jackie 00:00:00 R batch

Note: See the man pages for further options to qstat

II. Using the command tracejob users can check the status of their job.


  [node0000]$  tracejob 0 


Job: 0.hostname

03/02/2007 14:37:26  S    Job Queued at request of jackie@hostname, owner = jackie@hostname, job name = hello.sh, queue
                          = batch
03/02/2007 14:37:26  A    queue=batch
03/02/2007 14:37:27  S    Job Modified at request of Scheduler@hostname
03/02/2007 14:37:27  L    Job Run
03/02/2007 14:37:27  S    Job Run at request of Scheduler@hostname
03/02/2007 14:37:27  A    user=jackie group=staff jobname=hello.sh queue=batch ctime=1172875046 qtime=1172875046 etime=1172875046
                          start=1172875047 exec_host=node0172/0+node0001/0+node0000/0 Resource_List.nodect=3 
03/02/2007 14:37:51  S    Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=708kb resources_used.vmem=11036kb
                          resources_used.walltime=00:00:22
03/02/2007 14:37:51  A    user=jackie group=staff jobname=hello.sh queue=batch ctime=1172875046 qtime=1172875046 etime=1172875046
                          start=1172875047 exec_host=node0000/0 session=0 end=1172875071 Exit_status=0
                          resources_used.cput=00:00:00 resources_used.mem=708kb resources_used.vmem=11036kb
                          resources_used.walltime=00:00:22

How to script for a parallel run:

 vi mpi.sh

   #!/bin/bash
   #PBS -l nodes=16:ppn=2
   cd /home/jackie/QSUB

## Example 1: "This is an example for MPI using mpi under topspin"
   /usr/local/topspin/mpi/mpich/bin/mpirun_ssh -hostfile $PBS_NODEFILE -np 32 ./mpi-hello

## Example 2: "This is an example of MPI running openmpi located in /usr/bin"
    /usr/bin/mpirun --hostfile $PBS_NODEFILE -np 32 ./mpi-hello

## Example 3: "This is an example of lamboot running located in /usr/bin"
    /usr/bin/lamboot 
    /usr/bin/mpirun -hostfile $PBS_NODEFILE -np 32 ./mpi-hello
    /usr/bin/lamhalt

## Example 4: "This is an example of using open-mpi using modules"
    . /usr/Modules/init/bash
    module load open-mpi/1.2.6-gcc
    mpirun --hostfile $PBS_NODEFILE -np 32 ./mpi-hello

Now save the file.


You will submit this just like the other job. You might have to specify which queue to submit your job to. So read further to see what queues are available.

How to check the queues that are available :


 [node0000]$  qstat -Q 
   Queue            Max Tot Ena Str Que Run Hld Wat Trn Ext Type
---------------- --- --- --- --- --- --- --- --- --- --- ----------
batch              0   0 yes yes   0  0   0   0   0   0  Execution 
regular            0   0 yes yes   0  0   0   0   0   0  Execution 


NOTE: If there is more than one Execution queue listed then you will be required to specify the queue you want to submit your job to. If not the server will have a default queue set to batch and all jobs will go to the batch queue.
 [node0000]$  qsub -q regular script 
This example the job script will be sent to the regular queue. You can substitute regular for batch and it will go to the batch queue. Each of these queues are configured differently based on the limitations they have.

How to check what limits a queue has?

  [node0000]$  qstat -q

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
batch             --      --       --      4      3   0 --   E R
regular            --     --    48:00:00   --   1  20 --   E R

In this example the batch queue has a limit of 4 nodes only and the regular queue has a 48 hour wallclock time limit. So jobs submitted to that queue must complete within the 48 hours or it will be killed. Jobs going to the batch queue must request no more than 4 nodes or it will be rejected.

How do I delete a job?

In order to delete a job first find out the job id number: [node0000]$ qstat
Job id           Name                User          Time Use  S Queue
---------------- ---------------- ---------------- --------  - -----
1290.node0000      job.test         jackie         12:00:00  R  regular
1291.node0000      job2.test        jackie         01:00:00  R  batch
1292.node0000      job3.test        jackie         00:45:00  R  batch
1293.node0000      job4.test        jackie         00:30:00  R  batch
I want to delete job 1293. HOW???? [node0000]$ qdel 1293 Make sure the job was deleted. [node0000]$ qstat
Job id           Name                User          Time Use  S Queue
---------------- ---------------- ---------------- --------  - -----
1290.node0000      job.test         jackie         12:00:00  R  regular
1291.node0000      job2.test        jackie         01:00:00  R  batch
1292.node0000      job3.test        jackie         00:45:00  R  batch
It has been deleted because it is no longer listed in the output above.

How to check if a node is available:


pbsnodes -a 
This will show you the state of all the nodes.
# pbsnodes -a 
node0000 state = free np = 4 properties = shared ntype = cluster status = opsys=linux,uname=Linux node0000 2.6.9-11.ELsmp #1 SMP Fri May 20 18:25:30 EDT 2005 x86_64,sessi ons=? 0,nsessions=? 0,nusers=0,idletime=27114,totmem=8158988kb,availmem=8076976kb,physmem=8158988kb,ncpus=4,lo adave=0.00,netload=64785310,state=free,jobs=? 0,rectime=1174679615 node0001 state = free np = 4 properties = shared ntype = cluster status = opsys=linux,uname=Linux node0001 2.6.9-11.ELsmp #1 SMP Fri May 20 18:25:30 EDT 2005 x86_64,sessi ons=? 0,nsessions=? 0,nusers=0,idletime=27104,totmem=8158988kb,availmem=8076820kb,physmem=8158988kb,ncpus=4,lo adave=0.00,netload=64538012,state=free,jobs=? 0,rectime=1174679599