IT Division Scientific Cluster Support - Service Level Agreement I. Background This document describes the IT Division support model for Linux based computational clusters. It is intended to outline the expectations and the limitations of this service. Cluster require a high level of expertise to build and maintain. There are also a high number of failure points inherent to cluster systems. We are leveraging the economy of scale and experience supporting various cluster systems to offer a service that is both valuable and inexpensive compared to other cluster support offerings. In order to successfully do this, we have standardized on an implementation model that allows scalability from the perspective of the administrator and customizability so it can meet many needs. II. Introduction A. Parties Involved 1) IT Division HPC Services Group 2) End-User B. Purpose The purpose of this Service Level Agreement (SLA) is to specify the services and commitments of the Service Provider as well as the expectations and obligations of the Customer. III. Responsibilities and Metrics of Service Provider A. The Service Provider agrees it will provide: Basic Scientific Linux installation on the master node Warewulf cluster implementation toolkit MPI2 compatibility provided by OpenMPI SLURM scheduler Computer Room Space* Purchase and procurement consulting support Initial cluster build and setup Cluster debugging and testing Normal Operating System maintenance Assistance with running user application code on cluster Basic training on how to use cluster CPPM Security compliance System and network monitoring Hardware monitoring using NHC Faulty hardware replacement and troubleshooting Cluster and related subsystem upgrades (as needed) Crash recovery * Note: Computer room space is provided to clusters in the HPCS Program. on a monthly recharge basis. Because of limited data center resources, hardware must be removed 5 yrs after date of purchase. B. Hours of Operation Business Hours: Monday through Friday 8am-6pm PST IV. Responsibilities of the Customer A. The customer agrees it will: 1. Select a POC and describe the process of obtaining help or reporting problems to the end users. 2. Coordinate with the Service Provider on any major configuration changes (i.e. network installation, changes in topology, relocations, etc... 3. Customer shall maintain site conditions within recommended environment range of all systems, devices, and media covered. 4. Provide feedback to improve the service. 5. Develop end-user contingency operations plans and capabilities. 6. Identify what resources will be matrixed or transferred to the Service Provider, if applicable. 7. Provide the Service Provider with access to equipment both electronically (passwords) and physically (cardkey access, room keys), as needed to provide service. 8. Provide authorization of Service Provider activities (system upgrades, reboots, eta...) 9. Customer maintains final authority over the system(s) covered under this agreement and will maintain awareness of their responsibilities concerning the operation of system(s) under Laboratory RPM policy. This includes computing security and backups. B. To submit a request for help, the customer will: 1. Contact the IT Division Help Desk x4357 to submit a request for help or send email to firstname.lastname@example.org 2. Include relevant contact info. (i.e. name, organization, location, system hostname) 3. Provide a description of the problem, its urgency, and potential mission impact. 4. Be available to provide the Service Provider with additional information as needed. V. General Maintenance Responsibilities A. The following areas of concern need to be resolved before a Service Level Agreement can take effect. 1. Verification and setup of customer system(s) of both software and hardware. 2. Electronic and physical access to systems B. Customer will be responsible for all expenses incurred for all hardware and peripheral maintenance. C. Customer will be responsible for all expenses incurred for any application oriented software maintenance and licenses installed on the system(s). D. Customer with root access will void all service guarantees if their actions are the direct cause to a system failure or security breach. VI. Attachments. A. Definitions and Terminology B. Lists of supported hardware and software 1. Cluster Hardware Requirements: * All nodes utilize Intel x86 type architecture * Minimum of 10 nodes * Concurs with the standard Beowulf spec (one master node, with slave nodes on residing on a private subnet behind the master) * Slave nodes do not support console logins, nor can they be used as general workstations/servers * All slave nodes only reachable from master node * All slave nodes must support PXE boot using Warewulf 2. Cluster Software Requirements * Scientific Linux 6 Linux operating system * Warewulf cluster implementation toolkit * SLURM job scheduler * Intel compilers * MPI2 compatibility provided by OpenMPI 3. Cluster Storage Hardware * Low cost: Linux server with LSI RAID controller and SATA disks * Recommended: Bluearc or Network Appliance file server * High performance parallel: IBM GPFS storage, or Lustre parallel filesystem on Data Direct Networks storage hardware. 4. Clusters that will be located in the 50B-1275 Computer room must meet the following additional requirements * Rack mounted hardware required. * Equipment to be installed into APC Netshelter 42U computer racks. * Equipment cooling is front (intake) to back (exhaust) * Switched and metered 208V APC Rack PDUs Prospective cluster owners should include the cost of these racks into their budget * Physical and root access is limited to HPCS staff C. Exclusions The HPCS program only provides for support directly related to the cluster. Additional support for other aspects of the user computing environment are available on a Time and Materials basis. No direct support for application source debugging/engineering Reinstallation of the cluster to an earlier OS release is not covered by the SLA and will be done on a Time and Materials basis. Backups are the responsibility of the cluster owner. Backups can be provided by IT Division at additional cost D. Service and Fees 1. Clusters are only managed under a monthly Service Level Agreement.
2. Cost factors can be dependent on the cluster design. If all standards are followed, the basic cost will be $300/mo. for the master node and $15/mo. for each additional compute node, (e.g. Master node + 20 compute nodes = $600/month). There is an additional charge for clusters with a high performance network fabric such Infiniband $300/mo. Storage servers are also charged at $300/mo. Important Note: For configurations outside the standard, there will either be a time and materials for the difference or an increased monthly premium. These costs can usually be identified and explained during initial consultations. Please note that these are direct costs. LBNL burdens depend on the type of project.