Definitions:

Torque has 3 components: server, scheduler and MOM.
The server and the scheduler is self explanatory but the MOM is not.
What the MOM means is the "Job Executor", "The mother of all executing jobs".
Mom places a job into execution when it receives a copy of the job from a Server.


Torque Installation
-----------------------------------------------------------------------------

I. Download the RPM from http://bokeoa.com/rpms
selected torque-1.1.0p2-apsen3.rpm

which was the latest version as of Oct 2004

II. How to build from source on a Warewulf Cluster:

Master-node# rpmbuild --rebuild torque-1.1.0p2-apsen3.rpm

It wrote the files in /usr/src/redhat/RPMS/x86_64.

Master-node# cd /usr/src/redhat/RPMS/x86_64
Master-node# rpm -Uvh torque-....

Here is the list of what I installed:

torque-scheduler-1.1.0p2-aspen3 # Scheduler software
torque-mom-1.1.0p2-aspen3 # MOM software (execd for clients)
torque-docs-1.1.0p2-aspen3 # For man pages
torque-client-1.1.0p2-aspen3 # Client software only
torque-1.1.0p2-aspen3 # the actual software
torque-server-1.1.0p2-aspen3 # the server software
torque-debuginfo-1.1.0p2-aspen3 # Debugging purposes
torque-devel-1.1.0p2-aspen3 # Development software

Update the nodes with binutils and torque:

Master-node# wwvnfs.yum install binutils # needed for torque

Master-node# wwvnfs.rpm -Uvh torque-mom-1.1.0p2-aspen3.x86_64.rpm \
torque-1.1.0p2-aspen3.x86_64.rpm

III. Modifications needed:

Warewulf:

Master-node# vi /etc/warewulf/vnfs/excludes

Add these lines in the /var section of the excludes file

+ var/spool/torque/
+ var/spool/torque/mom_logs
+ var/spool/torque/mom_priv
/var/*/*

IV. Configure torque on the master and the nodes:

Master setup
============================================

Master-node# cd /var/spool/torque/mom_priv
Master-node# vi config

Recommended settings are as follows:

$clienthost 192.168.2.10 #note: TORQUE server running pbs_server
$clienthost 192.168.1.200 #note: fileserver needed for /home NFS mount
$restricted 192.168.2.10 #note: TORQUE server running pbs_server
$logevent 255 #note: All events except debugging

Here is a definition for each of these:

clienthost = causes a host name to be added to the list of hosts which will be
allowed to connect to MOM as long as they are using a priviledges port.

restricted hosts = causes a host name to be added to the list of hosts which will
be allowed to connect to MOM without needing to use a priviledged port.


Update TORQUE Server Configuration

* On the TORQUE server, $(TORQUECFG)/server_priv/nodes file:

add each node that you intend to be used by PBS.

Master-node# cd /var/spool/torque/server_priv
Master-node# vi nodes

node0000 np=2
node0001 np=2
....


Cluster nodes setup
========================================

Master-node# cd /vnfs/default/var/spool/torque/mom_priv
Master-node# cp /var/spool/torque/mom_priv/config .

This will make sure that all the nodes have the same configuration file for MOM.

V. Build the virtual filesystem.

Master-node# wwvnfs.build # build the node image
Master-node# pdsh -a "/sbin/reboot" # reboot the nodes with the new
# image

VI. How to start pbs

A. Start pbs_server and pbs_sched on the master

For the very first time do: pbs_server -t create

this will create a database and allow you to create
the queueing system using qmgr as below.

Configure the Queuing System

I created a template file which is stored on the master in /var/spool/torque called
qconfig. All that is needed is to run qmgr < qconfig from the master server.

i.e.
qmgr < [filename]

Here is a copy of the qconfig file:

# Create queues and set their attributes.
#
#
create queue default
set queue default queue_type = Route
set queue default max_running = 10
set queue default route_destinations = long
set queue default enabled = True
##set queue default route_destinations += debug
##set queue default route_destinations += short
##set queue default route_destinations += extended
#
# Create and define queue long
#
create queue long
set queue long queue_type = Execution
set queue long Priority = 60
set queue long max_running = 15
set queue long resources_max.cput = 48:00:00
set queue long resources_min.cput = 02:00:01
set queue long resources_default.cput = 48:00:00
set queue long enabled = True
set queue long started = True
#
# Set server attributes.
#
set server scheduling = True
set server max_user_run = 15
set server default_queue = default
set server log_events = 63
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.cput = 01:00:00
set server resources_default.neednodes = 1
set server resources_default.nodect = 1
set server resources_default.nodes = 1
set server scheduler_iteration = 60
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server default_node = 11

There are only 2 queues - default (router queue) and long (execution queue). The
server routes all jobs to the long queue at this time since there are no other queues
set up and no special request. The max cputime for this queue is 48 hours. The min
cputime is 2 hours. 15 jobs can run at the same time as long as the resources are
available. We can change this and add other execution queues if necessary.


B. Start pbs_mom on the nodes:

pdsh -a "/etc/init.d/pbs_mom start"

C. Restart the daemons on the master:

qterm -t quick (shutdown server)

pbs_server (start server)

pbsnodes -a (verify all nodes are correctly reporting)

D. You should be able to submit jobs to the cluster now.