SLURM V2.2 User's Guide extreme computing REFERENCE 86 A2 45FD 01 extreme compu

SLURM V2.2 User's Guide extreme computing REFERENCE 86 A2 45FD 01 extreme computing SLURM V2.2 User's Guide Software July 2010 BULL CEDOC 357 AVENUE PATTON B.P.20845 49008 ANGERS CEDEX 01 FRANCE REFERENCE 86 A2 45FD 01 The following copyright notice protects this book under Copyright laws which prohibit such actions as, but not limited to, copying, distributing, modifying, and making derivative works. Copyright © Bull SAS 2010 Printed in France Trademarks and Acknowledgements We acknowledge the rights of the proprietors of the trademarks mentioned in this manual. All brand names and software and hardware product names are subject to trademark and/or patent protection. Quoting of brand and product names is for information purposes only and does not represent trademark misuse. The information in this document is subject to change without notice. Bull will not be liable for errors contained herein, or for incidental or consequential damages in connection with the use of this material. Preface iii Table of Contents Preface ............................................................................................................................ix Chapter 1. SLURM Overview..................................................................................... 1 1.1 SLURM Key Functions.........................................................................1 1.2 SLURM Components ..........................................................................2 1.3 SLURM Daemons...............................................................................2 1.3.1 SLURMCTLD ............................................................................................................. 2 1.3.2 SLURMD .................................................................................................................. 4 1.3.3 SlurmDBD (SLURM Database Daemon) ........................................................................ 4 1.4 Scheduler Types................................................................................5 1.5 The slurm.conf configuration file ..........................................................6 1.6 SCONTROL – Managing the SLURM Configuration ................................7 Chapter 2. Installing and Configuring SLURM............................................................. 9 2.1 Installing SLURM ...............................................................................9 2.2 Configuring SLURM on the Management Node......................................9 2.2.1 Create and Modify the SLURM configuration file .......................................................... 9 2.2.2 Setting up a slurmdbd.conf file ................................................................................. 14 2.2.3 MySQL Configuration.............................................................................................. 15 2.2.4 Setting up a topology.conf file.................................................................................. 15 2.2.5 Final Configuration Steps......................................................................................... 16 2.2.6 Completing the Configuration of SLURM on the Management Node Manually............... 16 2.3 Configuring SLURM on the Reference Node ........................................17 2.3.1 Using the slurm_setup.sh Script................................................................................. 17 2.3.2 Manually configuring SLURM on the Reference Nodes ................................................ 19 2.3.3 Starting the SLURM Daemons on a Single Node......................................................... 20 2.4 Check and Start the SLURM Daemons on Compute Nodes.....................20 2.5 Configuring Pam_Slurm Module ........................................................21 2.6 Installing and Configuring Munge for SLURM Authentication (MNGT) .....21 2.6.1 Introduction............................................................................................................ 21 2.6.2 Creating a Secret Key ............................................................................................. 22 iv SLURM V2.2 - User's Guide 2.6.3 Starting the Daemon................................................................................................23 2.6.4 Testing the Installation..............................................................................................23 Chapter 3. Administrating Cluster Activity with SLURM ............................................. 25 3.1 The SLURM Daemons ...................................................................... 25 3.2 Starting the Daemons ...................................................................... 25 3.3 SLURMCTLD (Controller Daemon) ...................................................... 26 3.4 SLURMD (Compute Node Daemon) ................................................... 27 3.5 SLURMDBD (Slurmd Database Daemon)............................................. 28 3.6 Node Selection .............................................................................. 28 3.7 Logging......................................................................................... 28 3.8 Corefile Format .............................................................................. 29 3.9 Security......................................................................................... 29 3.10 SLURM Cluster Administration Examples............................................. 29 Chapter 4. SLURM High Availability......................................................................... 33 4.1 SLURM High Availability .................................................................. 33 4.1.1 SLURM High Availability using Text File Accounting ....................................................33 4.1.2 SLURM High Availability using Database Accounting ..................................................34 4.2 Starting SLURM High Availability ...................................................... 35 4.3 Restoring SLURM High Availability following a Primary Management Node crash ............................................................................................ 36 Chapter 5. Managing Resources using SLURM.......................................................... 37 5.1 SLURM Resource Management Utilities............................................... 37 5.2 MPI Support................................................................................... 37 5.3 SRUN ........................................................................................... 39 5.4 SBATCH (batch).............................................................................. 40 5.5 SALLOC (allocation) ........................................................................ 41 5.6 SATTACH....................................................................................... 42 5.7 SACCTMGR................................................................................... 43 Preface v 5.8 SBCAST.........................................................................................44 5.9 SQUEUE (List Jobs) ..........................................................................45 5.10 SINFO (Report Partition and Node Information) ...................................46 5.11 SCANCEL (Signal/Cancel Jobs) ........................................................47 5.12 SACCT (Accounting Data) ................................................................48 5.13 STRIGGER......................................................................................49 5.14 SVIEW...........................................................................................50 5.15 Global Accounting API.....................................................................51 Chapter 6. Tuning Performances for SLURM Clusters................................................. 55 6.1 Configuring and Sharing Consumable Resources in SLURM ...................55 6.2 SLURM and Large Clusters ................................................................55 6.2.1 Node Selection Plug-in (SelectType) .......................................................................... 55 6.2.2 Job Accounting Gather Plug-in (JobAcctGatherType) ................................................... 55 6.2.3 Node Configuration................................................................................................ 56 6.2.4 Timers.................................................................................................................... 56 6.2.5 TreeWidth parameter .............................................................................................. 56 6.2.6 Hard Limits............................................................................................................. 56 6.3 SLURM Power Saving Mechanism......................................................56 6.3.1 Configuring Power Saving ....................................................................................... 57 6.3.2 Fault tolerance........................................................................................................ 59 Chapter 7. Troubleshooting SLURM.......................................................................... 61 7.1 SLURM does not start .......................................................................61 7.2 SLURM is not responding..................................................................61 7.3 Jobs are not getting scheduled...........................................................62 7.4 Nodes are getting set to a DOWN state.............................................62 7.5 Networking and Configuration Problems.............................................63 7.6 More Information ............................................................................64 Glossary........................................................................................................................ 65 Index............................................................................................................................. 67 vi SLURM V2.2 - User's Guide Preface vii List of figures Figure 1-1. SLURM Simplified Architecture ........................................................................................ 2 Figure 1-2. SLURM Architecture - Subsystems..................................................................................... 3 Figure 4-1. SLURM High Availability using Text File Accounting......................................................... 34 Figure 5-1. MPI Process Management With and Without Resource Manager ...................................... 38 List of tables Table 1-1. Role Descriptions for SLURMCTLD Software Subsystems ..................................................... 3 Table 1-2. SLURMD Subsystems and Key Tasks................................................................................. 4 Table 1-3. SLURM Scheduler Types ................................................................................................. 5 Preface ix Preface Note The Bull Support Web site may be consulted for product information, documentation, downloads, updates and service offers: http://support.bull.com Scope and Objectives A resource manager is used to allocate resources, to find out the status of resources, and to collect task execution information. Bull Extreme Computing platforms use SLURM, an open- source, scalable resource manager. This guide describes how to configure, manage, and use SLURM. Intended Readers This guide is for Administrators and Users of Bull Extreme Computing systems. Prerequisites This manual applies to SLURM versions from version 2.2, unless otherwise indicated. mportant The Software Release Bulletin contains the latest information for your delivery. This should be read first. Contact your support representative for more information. x SLURM V2.2 - User's Guide Chapter 1. SLURM Overview. 1 Chapter 1. SLURM Overview Merely grouping together several machines on a network is not enough to constitute a real cluster. Resource Management software is required to optimize the throughput within the cluster, according to specific scheduling policies. A resource manager is used to allocate resources, to find out the status of resources, and to collect task execution information. From this information, the scheduling policy can be applied. Bull Extreme Computing platforms use SLURM, an Open-Source, scalable resource manager. 1.1 SLURM Key Functions As a cluster resource manager, SLURM has three key functions. Firstly, it allocates exclusive and/or non-exclusive access to resources (Compute Nodes) to users for some duration of time so they can perform work. Secondly, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work. Optional plug-ins can be used for accounting, advanced reservation, backfill scheduling, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms. Users interact with SLURM using various command line utilities: • SRUN to submit a job for execution • SBCAST to transmit a file to all nodes running a job • SCANCEL to terminate a pending or running job • SQUEUE to monitor job queues • SINFO to monitor partition and the overall system state • SACCTMGR to view and modify SLURM account information. Used with the slurmdbd daemon • SACCT to display data for all jobs and job steps in the SLURM accounting log • SBATCH for submitting a batch script to SLURM • SALLOC for allocating resources for a SLURM job • SATTACH to attach to a running SLURM job step. • STRIGGER used to set, get or clear SLURM event triggers • SVIEW used to display SLURM state information graphically. Requires an XWindows capable display • SREPORT used to generate reports from the SLURM accounting data when using an accounting database • SSTAT used to display various status information of a running job or step See The man pages for the commands above for more information. 2 SLURM V2.2 - User's Guide System administrators perform privileged operations through an additional command line utility, SCONTROL. The central controller daemon, SLURMCTLD, maintains the global state and directs operations. Compute nodes simply run a SLURMD daemon (similar to a remote shell daemon) to export control to SLURM. 1.2 SLURM Components SLURM consists of two types of daemons and various command-line user utilities. The relationships between these components are illustrated in the following diagram: Figure 1-1. SLURM Simplified Architecture 1.3 SLURM Daemons 1.3.1 SLURMCTLD The central control daemon for SLURM is called SLURMCTLD. SLURMCTLD is multi – threaded; thus, some threads can handle problems without delaying services to normal jobs that are also running and need attention. SLURMCTLD runs on a single management node (with a fail-over spare copy elsewhere for safety), reads the SLURM configuration file, and maintains state information on: Chapter 1. SLURM Overview. 3 • Nodes (the basic compute resource) • Partitions (sets of nodes) • Jobs (or resource allocations to run jobs for a time period) • Job steps (parallel tasks within a job). The SLURMCTLD daemon in turn consists of three software subsystems, each with a specific role: Software Subsystem Role Description Node Manager Monitors the state and configuration of each node in the cluster. It receives state-change messages from each Compute Node's SLURMD daemon asynchronously, and it also actively polls these daemons periodically for status reports. Partition Manager Groups nodes into disjoint sets (partitions) and assigns job limits and access controls to each partition. The partition manager also allocates nodes to jobs (at the request of the Job Manager) based on job and partition properties. SCONTROL is the (privileged) user utility that can alter partition properties. Job Manager Accepts job requests (from SRUN or a metabatch uploads/Management/ user-guide-slurm.pdf

  • 101
  • 0
  • 0
Afficher les détails des licences
Licence et utilisation
Gratuit pour un usage personnel Attribution requise
Partager
  • Détails
  • Publié le Aoû 10, 2022
  • Catégorie Management
  • Langue French
  • Taille du fichier 0.5752MB