draft
from the notes of jscoggins Mar-16-2003
Install the sge rpm. Instructions and information can be found at http://warewulf-cluster.org/addons/sge.shtml
Once the package has been installed you can set up the queues manually
or using the GUI interface. The GUI interface seems to be the fastest
but you can write scripts for the command line just as quickly.
How to add a queue to a cluster:
GUI - run the command 'qmon'
click on Queue Control
click on Add button
Inside this window you will see a bunch of options.
Queue - enter a queue name (i.e. node001.q)
Hostname - enter the node name for this queue to reside on. (node001)
General Configuration -
Sequence Nr. - Leave at 0 because that is the start of the job id for the
queue. If you want to make it higher you might affect the ordering
of the jobs
in SGE.
Processors - UNDEFINED - leave alone unless you want to to limit the number
of processors a task can use on any system within that queue.
tmp Directory - /tmp - You can change this if your /tmp filesystem
does not have enough space to house temporary files.
Shell Start Mode - By default it is NONE - choices are posix_compliant
script_from_stdin unix_behavior.
Initial State - default leave the same. - choices are enabled or disabled.
Default is enabled.
Calendar - No settings here. I am not controlling when a queue can
run jobs. Although this can beNotify Time - Just leave it alone as
well. 60 seconds
is the time that passes before a KILL signal is sent after sending a user
defined signal.
Job's
Nice - 0 is default. The numbers range from -20 to 20 (-20 highest
and 20 lowest). Never try to nice a job higher than swap/idle/etc.
The highest nice value one should get is -1 or 0. The lowest does
not matter. If it is around 20 that job will just be crawling
like a turtle.
Slots
- this determines how many jobs can run in this queue.
That is strictly up to the type of jobs you will be running.
Type section
Batch - check
Interactive - check
Parallel - check
Don't check - checkpoint and transfer.
Load/Suspend Thresholds
Load Thresholds - prevent the scheduling of additional jobs to the queue.
A threshold can be supplied for any load value. If one of the load
thresholds is exceeded the queue is set to the alarm state and no more
jobs are scheduled to this queue.
Suspend Thresholds - can be used to suspend jobs running on this queue if a load value is exceeded.
Both of these are your choices and depends on what the work load for the
group will be.
COMMAND LINE:
qconf -aq
Will return the following template. Make the changes were you
want them
to be. I will highlight the changes with an ** that are normally
made:
qname
template
**
hostname
unknown
**
seq_no
0
load_thresholds np_load_avg=1.75
**
suspend_thresholds NONE
**
nsuspend
1
suspend_interval 00:05:00
priority
0
min_cpu_interval 00:05:00
processors
UNDEFINED
qtype
BATCH INTERACTIVE PARALLEL
rerun
FALSE
slots
1
**
tmpdir
/tmp
shell
/bin/csh
shell_start_mode NONE
prolog
NONE
epilog
NONE
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify
00:00:60
owner_list
NONE
user_lists
NONE
xuser_lists NONE
subordinate_list NONE
complex_list NONE
complex_values NONE
calendar
NONE
initial_state default
s_rt
INFINITY
h_rt
INFINITY
s_cpu
INFINITY
h_cpu
INFINITY
s_fsize
INFINITY
h_fsize
INFINITY
s_data
INFINITY
h_data
INFINITY
s_stack
INFINITY
h_stack
INFINITY
s_core
INFINITY
h_core
INFINITY
s_rss
INFINITY
h_rss
INFINITY
s_vmem
INFINITY
h_vmem
INFINITY
edit from this editor and then issue the command:
qconf -sq <new-queue name> and make sure that your changes took affect.
You can make this a template by running the above command and saving
it to a file. Make changes to queue name and nodename and you
can clone the other
systems via a script.
I will go into Complexs and Subordinates and Execution methods later. For now you don't need to worry about these for a basic setup.
II. Parallel job considerations:
If you are going to run lam or some type of mpi jobs then you will need to install the sge parallel lamstart and lamstop scripts.
place these in /sge/lam/
>>> cat lamstart.sh
#!/bin/sh
cat /dev/null > /tmp/lamnodes-$USER.$HOSTNAME
cat $1 | while read line; do
host=`echo $line | cut -f1 -d" "| cut -f1 -d"."`
nslots=`echo $line | cut -f2 -d" "`
echo "${host} cpu=${nslots}" >> /tmp/lamnodes-$USER.$HOSTNAME
done
/usr/bin/lamboot /tmp/lamnodes-$USER.$HOSTNAME >/dev/null
rm -f /tmp/lamnodes-$USER.$HOSTNAME
>>> cat lamstop.sh
#!/bin/sh
lamhalt >/dev/null
Setting up a Parallel Environment Configuration from the GUI:
Click the Add button:
Name = lam
Slots = 16 or total for
the entire cluster
Queues = all
Users = NONE (there is no
user list anyone can use it)
Xusers = NONE ("")
Start Proc Args = /sge/lam/lamstart.sh
$pe_hostfile
Stop Proc Args = /sge/lam/lamstop.sh
Allocation Rule = $fill_up
= the number of parallel processes to
be allocated on each machine which
is used by a PE. fill_up means to
use all of the available processes on
a machine.
How to submit jobs?
qsub is the command to use to submit jobs to the queue.
There is generally a script
written that contains submit information for sge to know what to do with
the job. Read the qsub man pages that should help out some. There
are also samples in /sge/examples/jobs. Check those out as well.