IT Division Scientific Cluster Support SLA

I. Background

This document describes the IT Division support model for Linux based
computational clusters. It is intended to outline the expectations and
the limitations of this service.

Cluster require a high level of expertise to build and maintain. There are
also a high number of failure points inherent to cluster systems. We are
leveraging the economy of scale and experience supporting various cluster
systems to offer a service that is both valuable and inexpensive compared
to other cluster support offerings. In order to successfully do this,
we have standardized on an implementation model that allows scalability
from the perspective of the administrator and customizability so it can
meet many needs.

II. Introduction
  A. Parties Involved
	1) IT Division UNIX Systems Group
 	2) End-User
  B. Purpose
The purpose of this Service Level Agreement (SLA) is to specify
the services and commitments of the Service Provider as well as the
expectations and obligations of the Customer.

III. Responsibilities and Metrics of Service Provider

  A. The Service Provider agrees it will provide:
	Basic RedHat installation on the master node 
	Warewulf cluster implementation toolkit 
	MPI2 compatibility provided by LAM or OpenMPI 
	Torque scheduler configuration 
	Computer Room Space*
	Purchase and procurement consulting support 
	Limited use of small test cluster 
	Initial cluster build and setup 
	Cluster debugging and testing 
	Normal Operating System maintenance 
	Assistance with running user application code on cluster 
	Training on how to use cluster 
	CPPM Security compliance 
	System and network monitoring 
	Hardware monitoring 
	Faulty hardware replacement and troubleshooting 
	Cluster and related subsystem upgrades (as needed) 
	Crash recovery

  * Note: Computer room space is provided to clusters in the SCS Program.
  Computer room space is available on a recharge basis for customers not
  in the SCS program.

  B. Hours of Operation
	Business Hours:  Monday through Friday 8am-6pm PST


IV. Responsibilities of the Customer
  A. The customer agrees it will:
	1. Select a POC and describe the process of obtaining help or
	reporting problems to the end users.
	2. Coordinate with the Service Provider on any major configuration
	changes (i.e. network installation, changes in topology,
	relocations, etc...
	3. Customer shall maintain site conditions within recommended
	environment range of all systems, devices, and media covered.
	4. Provide feedback to improve the service.
 	5. Develop end-user contingency operations plans and capabilities.
	6. Identify what resources will be matrixed or transferred to the
	Service Provider, if applicable.
	7. Provide the Service Provider with access to equipment both
	electronically (passwords) and physically (cardkey access, room keys),
	as needed to provide service.
	8. Provide authorization of Service Provider activities
	(system upgrades, reboots, eta...)
	9. Customer maintains final authority over the system(s) covered
	under this agreement and will maintain awareness of their
	responsibilities concerning the operation of system(s) under
	Laboratory RPM policy. This includes computing security.

  B. To submit a request for help, the customer will:
	1. Contact the IT Division Help Desk x4357 to submit a request for
	help or send email to scs@lbl.gov
	2. Include relevant contact info. (i.e. name, organization,
	location, system hostname)
	3. Provide a description of the problem, its urgency, and
	potential mission impact.
	4. Be available to provide the Service Provider with additional
	information as needed.


V. General Maintenance Responsibilities
  A. The following areas of concern need to be resolved before a Service
     Level Agreement can take effect.
	1. Verification and setup of customer system(s) of both software
	and hardware.
	2. Electronic and physical access to systems
  B. Customer will be responsible for all expenses incurred for all
     hardware and peripheral maintenance.
  C. Customer will be responsible for all expenses incurred for any
     application oriented software maintenance and licenses installed on
     the system(s).
  D. Customer with root access will void all service guarantees if their
     actions are the direct cause to a system failure or security breach.


VI.	Attachments.
  A. Definitions and Terminology
  B. Lists of supported hardware and software
	1. Cluster Hardware Requirements: 
	* All nodes utilize Intel x86-64 or AMD64 type architecture 
	* Minimum of 10 nodes 
	* All nodes above 1000MHz CPU clock, and 1GB of RAM. 
	* Concurs with the standard Beowulf spec (one master node, with
	  slave nodes on residing on a private subnet behind the master) 
	* Slave nodes do not support console logins, nor can they be
	  used as general workstations/servers 
	* All slave nodes only reachable from master node
	* All slave nodes must support PXE boot using Perceus
	  (AMD Opteron w/Nvidia chipset on mainboard currently not supported) 
	* Interconnect limited to GigE, Myrinet 2000, Cisco Infiniband 

	2. Cluster Software Requirements
	* Red Hat or Centos 4.x Linux operating system 
	* Perceus or Warewulf cluster implementation toolkit 
        * Torque scheduler with Maui 
	* Intel, Portland Group, or Pathscale compilers
	* MPI2 compatibility provided by LAM-MPI or OpenMPI

	3. Cluster Storage Hardware
	* Low cost: Linux server with 3Ware RAID controller and SATA disks
	* Recommended:  Network Appliance file server
	* High performance parallel:  Panasas Activescale cluster storage

	4. Clusters that will be located in the 50B-1275 Computer room must
     	   meet the following additional requirements 
	* Rack mounted hardware required. Desktop form factor hardware
	  not allowed 
	* Equipment to be installed into APC Netshelter computer racks.
	* Equipment cooling is front (intake) to back (exhaust)
	* Switched and metered 208V or 240V APC Rack PDUs 
	  Prospective cluster owners should include the cost of these racks
	  into their budget
	* Physical access is limited to SCS staff 


  C. Exclusions

	The SCS program only provides for support directly related to the
	cluster. Additional support for other aspects of the user computing
	environment are available on a Time and Materials basis.
	No direct support for application source debugging/engineering 
	Reinstallation of the cluster to an earlier OS release is not covered
	by the SLA and will be done on a Time and Materials basis.
	Backups can be provided by IT Division at additional cost


  D. Service and Fees

	1. Support costs are waived for projects in the SCS program.

	2. Cost factors are very dependent on the cluster design. If
	all standards are followed, the basic cost will be $800/mo.
	for the master node and $25/mo. for each additional compute node,
	(e.g. Master node + 20 compute nodes = $1300/month). There is an
	additional $200/mo charge for clusters with a high performance
	network fabric such as Myrinet or Infiniband. For configurations
	outside the standard, there will either be a time and materials for the
	difference or an increased monthly premium. These costs can usually
	be identified and explained during initial consultations.