Installation and Configuration of the SunGridEngine

draft

from the notes of jscoggins  Mar-16-2003


Install the sge rpm.  Instructions and information can be found at   http://warewulf-cluster.org/addons/sge.shtml

Once the package has been installed you can set up the queues manually or using the GUI interface.  The GUI interface seems to be the fastest but you can write scripts for the command line just as quickly.
 

How to add a queue to a cluster:

GUI - run the command 'qmon'
              click on Queue Control
              click on Add button
Inside this window you will see a bunch of options.

Queue - enter a queue name (i.e. node001.q)
Hostname - enter the node name for this queue to reside on. (node001)

General Configuration -
            Sequence Nr. - Leave at 0 because that is the start of the job id for the queue.  If you want to make it  higher you might affect the ordering of the jobs
                in SGE.
            Processors - UNDEFINED - leave alone unless you want to to limit the number of processors a task can use on any system within that queue.
            tmp Directory - /tmp  - You can change this if your /tmp  filesystem does not have enough space to house temporary files.
            Shell Start Mode - By default it is NONE - choices are posix_compliant script_from_stdin unix_behavior.
            Initial State - default leave the same. - choices are enabled or disabled.  Default is enabled.
            Calendar - No settings here.  I am not controlling when a queue can run jobs.  Although this can beNotify Time - Just leave it alone as well.  60 seconds
                              is the time that passes before a KILL signal is sent after sending a user defined signal.

            Job's  Nice - 0 is default.  The numbers range from -20 to 20 (-20 highest and 20 lowest).  Never try to nice a job higher than swap/idle/etc.
                            The highest nice value one should get is -1 or 0.  The lowest does not matter.  If it is around 20 that job will  just be crawling like a turtle.

            Slots - this determines how many jobs can run in this queue.
                        That is strictly up to the type of jobs you will be running.

                    Type section
                                Batch - check
                                Interactive - check
                                Parallel -  check

                     Don't check  - checkpoint and transfer.

               Load/Suspend Thresholds

               Load Thresholds - prevent the scheduling of additional jobs to the queue.  A threshold can be supplied for any load value.  If one of the load
                    thresholds is exceeded the queue is set to the alarm state and no more jobs are scheduled to this queue.

                Suspend Thresholds - can be used to suspend jobs running on this queue if a load value is exceeded.

                Both of these are your choices and depends on what the work load for the group will be.
 

COMMAND LINE:

        qconf -aq

Will return the following template.  Make the changes were you want them
to be.  I will highlight the changes with an ** that are normally made:
 

qname                template                **
hostname             unknown            **
seq_no               0
load_thresholds      np_load_avg=1.75   **
suspend_thresholds   NONE               **
nsuspend             1
suspend_interval     00:05:00
priority             0
min_cpu_interval     00:05:00
processors           UNDEFINED
qtype                BATCH INTERACTIVE PARALLEL
rerun                FALSE
slots                1                  **
tmpdir               /tmp
shell                /bin/csh
shell_start_mode     NONE
prolog               NONE
epilog               NONE
starter_method       NONE
suspend_method       NONE
resume_method        NONE
terminate_method     NONE
notify               00:00:60
owner_list           NONE
user_lists           NONE
xuser_lists          NONE
subordinate_list     NONE
complex_list         NONE
complex_values       NONE
calendar             NONE
initial_state        default
s_rt                 INFINITY
h_rt                 INFINITY
s_cpu                INFINITY
h_cpu                INFINITY
s_fsize              INFINITY
h_fsize              INFINITY
s_data               INFINITY
h_data               INFINITY
s_stack              INFINITY
h_stack              INFINITY
s_core               INFINITY
h_core               INFINITY
s_rss                INFINITY
h_rss                INFINITY
s_vmem               INFINITY
h_vmem               INFINITY
 

edit from this editor and then issue the command:

qconf -sq <new-queue name> and make sure that your changes took affect.

You can make this a template by running the above command and saving it to a  file.  Make changes to queue name and nodename and you can clone the other
systems via a script.



NOTES:

I will go into Complexs and Subordinates and Execution methods later.  For now you don't need to worry about these for a basic setup.

II.  Parallel job considerations:

If you are going to run lam or some type of mpi jobs then you will need to install the sge parallel lamstart and lamstop scripts.

place these in  /sge/lam/
 

>>> cat lamstart.sh

#!/bin/sh

cat /dev/null > /tmp/lamnodes-$USER.$HOSTNAME
cat $1 | while read line; do
    host=`echo $line | cut -f1 -d" "| cut -f1 -d"."`
    nslots=`echo $line | cut -f2 -d" "`
    echo "${host} cpu=${nslots}" >> /tmp/lamnodes-$USER.$HOSTNAME
done

/usr/bin/lamboot /tmp/lamnodes-$USER.$HOSTNAME >/dev/null

rm -f /tmp/lamnodes-$USER.$HOSTNAME

>>> cat lamstop.sh

#!/bin/sh

lamhalt >/dev/null


Setting up a Parallel Environment Configuration from the GUI:

Click the Add button:
Name = lam
        Slots = 16 or total for the entire cluster
        Queues = all
        Users = NONE (there is no user list anyone can use it)
        Xusers = NONE ("")
        Start Proc Args = /sge/lam/lamstart.sh $pe_hostfile
        Stop Proc Args = /sge/lam/lamstop.sh
        Allocation Rule = $fill_up = the number of parallel processes to
                                     be allocated on each machine which
                                     is used by a PE.  fill_up means to
                                     use all of the available processes on
                                     a machine.


How to submit jobs?

        qsub is the command to use to submit jobs to the queue.

        There is generally a script written that contains submit information for sge to know what to do with the job.  Read the qsub man pages that should help out some. There are also samples in /sge/examples/jobs.  Check those out as well.



last  updated  march 17, 2003
sjames@lbl.gov