IT Division Scientific Cluster Support SLA
I. Background
This document describes the IT Division support model for Linux based
computational clusters. It is intended to outline the expectations and
the limitations of this service.
Cluster require a high level of expertise to build and maintain. There are
also a high number of failure points inherent to cluster systems. We are
leveraging the economy of scale and experience supporting various cluster
systems to offer a service that is both valuable and inexpensive compared
to other cluster support offerings. In order to successfully do this,
we have standardized on an implementation model that allows scalability
from the perspective of the administrator and customizability so it can
meet many needs.
II. Introduction
A. Parties Involved
1) IT Division UNIX Systems Group
2) End-User
B. Purpose
The purpose of this Service Level Agreement (SLA) is to specify
the services and commitments of the Service Provider as well as the
expectations and obligations of the Customer.
III. Responsibilities and Metrics of Service Provider
A. The Service Provider agrees it will provide:
Basic RedHat installation on the master node
Warewulf cluster implementation toolkit
MPI2 compatibility provided by LAM or OpenMPI
Torque scheduler configuration
Computer Room Space*
Purchase and procurement consulting support
Limited use of small test cluster
Initial cluster build and setup
Cluster debugging and testing
Normal Operating System maintenance
Assistance with running user application code on cluster
Training on how to use cluster
CPPM Security compliance
System and network monitoring
Hardware monitoring
Faulty hardware replacement and troubleshooting
Cluster and related subsystem upgrades (as needed)
Crash recovery
* Note: Computer room space is provided to clusters in the SCS Program.
Computer room space is available on a recharge basis for customers not
in the SCS program.
B. Hours of Operation
Business Hours: Monday through Friday 8am-6pm PST
IV. Responsibilities of the Customer
A. The customer agrees it will:
1. Select a POC and describe the process of obtaining help or
reporting problems to the end users.
2. Coordinate with the Service Provider on any major configuration
changes (i.e. network installation, changes in topology,
relocations, etc...
3. Customer shall maintain site conditions within recommended
environment range of all systems, devices, and media covered.
4. Provide feedback to improve the service.
5. Develop end-user contingency operations plans and capabilities.
6. Identify what resources will be matrixed or transferred to the
Service Provider, if applicable.
7. Provide the Service Provider with access to equipment both
electronically (passwords) and physically (cardkey access, room keys),
as needed to provide service.
8. Provide authorization of Service Provider activities
(system upgrades, reboots, eta...)
9. Customer maintains final authority over the system(s) covered
under this agreement and will maintain awareness of their
responsibilities concerning the operation of system(s) under
Laboratory RPM policy. This includes computing security.
B. To submit a request for help, the customer will:
1. Contact the IT Division Help Desk x4357 to submit a request for
help or send email to scs@lbl.gov
2. Include relevant contact info. (i.e. name, organization,
location, system hostname)
3. Provide a description of the problem, its urgency, and
potential mission impact.
4. Be available to provide the Service Provider with additional
information as needed.
V. General Maintenance Responsibilities
A. The following areas of concern need to be resolved before a Service
Level Agreement can take effect.
1. Verification and setup of customer system(s) of both software
and hardware.
2. Electronic and physical access to systems
B. Customer will be responsible for all expenses incurred for all
hardware and peripheral maintenance.
C. Customer will be responsible for all expenses incurred for any
application oriented software maintenance and licenses installed on
the system(s).
D. Customer with root access will void all service guarantees if their
actions are the direct cause to a system failure or security breach.
VI. Attachments.
A. Definitions and Terminology
B. Lists of supported hardware and software
1. Cluster Hardware Requirements:
* All nodes utilize Intel x86-64 or AMD64 type architecture
* Minimum of 10 nodes
* All nodes above 1000MHz CPU clock, and 1GB of RAM.
* Concurs with the standard Beowulf spec (one master node, with
slave nodes on residing on a private subnet behind the master)
* Slave nodes do not support console logins, nor can they be
used as general workstations/servers
* All slave nodes only reachable from master node
* All slave nodes must support PXE boot using Perceus
(AMD Opteron w/Nvidia chipset on mainboard currently not supported)
* Interconnect limited to GigE, Myrinet 2000, Cisco Infiniband
2. Cluster Software Requirements
* Red Hat or Centos 4.x Linux operating system
* Perceus or Warewulf cluster implementation toolkit
* Torque scheduler with Maui
* Intel, Portland Group, or Pathscale compilers
* MPI2 compatibility provided by LAM-MPI or OpenMPI
3. Cluster Storage Hardware
* Low cost: Linux server with 3Ware RAID controller and SATA disks
* Recommended: Network Appliance file server
* High performance parallel: Panasas Activescale cluster storage
4. Clusters that will be located in the 50B-1275 Computer room must
meet the following additional requirements
* Rack mounted hardware required. Desktop form factor hardware
not allowed
* Equipment to be installed into APC Netshelter computer racks.
* Equipment cooling is front (intake) to back (exhaust)
* Switched and metered 208V or 240V APC Rack PDUs
Prospective cluster owners should include the cost of these racks
into their budget
* Physical access is limited to SCS staff
C. Exclusions
The SCS program only provides for support directly related to the
cluster. Additional support for other aspects of the user computing
environment are available on a Time and Materials basis.
No direct support for application source debugging/engineering
Reinstallation of the cluster to an earlier OS release is not covered
by the SLA and will be done on a Time and Materials basis.
Backups can be provided by IT Division at additional cost
D. Service and Fees
1. Support costs are waived for projects in the SCS program.
2. Cost factors are very dependent on the cluster design. If
all standards are followed, the basic cost will be $800/mo.
for the master node and $25/mo. for each additional compute node,
(e.g. Master node + 20 compute nodes = $1300/month). There is an
additional $200/mo charge for clusters with a high performance
network fabric such as Myrinet or Infiniband. For configurations
outside the standard, there will either be a time and materials for the
difference or an increased monthly premium. These costs can usually
be identified and explained during initial consultations.