Slurm HPC Workload Manager

From GM-RKB

(Redirected from Slurm)

Jump to navigation Jump to search

A Slurm HPC Workload Manager is a workload manager that can support slurm hpc workload management tasks in high-performance computing environments.

AKA: SLURM, Simple Linux Utility for Resource Management, Slurm, Slurm Workload Manager, Slurm Scheduler.
Context:
- It can typically manage Slurm Job Queues through slurm scheduling algorithms and slurm priority mechanisms.
- It can typically allocate Slurm Compute Nodes via policies and slurm partition configurations.
- It can typically support Slurm Job Arrays for slurm parallel job execution and slurm task distribution.
- It can typically implement Slurm Fair-Share Scheduling through slurm fairshare algorithms and slurm priority factors.
- It can typically handle Slurm Node Failures via slurm fault tolerance mechanisms and slurm job checkpoints.
- It can typically enforce Slurm Resource Limits through slurm accounting databases and slurm quality of service.
- It can often enable Slurm Burst Buffers for slurm i/o optimization and slurm staging area management.
- It can often support Slurm Federations for slurm multi-cluster job submission and slurm resource sharing.
- It can often integrate with Slurm Job Submission Tools including slurm sbatch, slurm srun, and slurm salloc.
- It can range from being a Small Slurm Cluster to being a Supercomputing Slurm Cluster, depending on its slurm cluster scale.
- It can range from being a Homogeneous Slurm System to being a Heterogeneous Slurm System, depending on its slurm node architecture diversity.
- It can range from being a Single-Site Slurm System to being a Multi-Site Slurm System, depending on its slurm geographic distribution.
- It can range from being a CPU-Only Slurm System to being a GPU-Accelerated Slurm System, depending on its slurm hardware accelerator support.
- ...
Examples:
- Slurm Controller Daemons, such as:
  - Slurm Control Daemon (slurmctld) for slurm central management.
  - Slurm Database Daemon (slurmdbd) for slurm accounting storage.
- Slurm Compute Daemons, such as:
  - Slurm Daemon (slurmd) for slurm node management.
  - Slurm Step Daemon (slurmstepd) for slurm job step execution.
- TOP500 Slurm Deployments, such as:
  - Fugaku Supercomputer Slurm at RIKEN (#1 TOP500 June 2020-2022).
  - Summit Supercomputer Slurm at Oak Ridge National Laboratory.
  - Perlmutter Supercomputer Slurm at NERSC with 7000+ GPU nodes.
- Academic HPC Slurm Installations, such as:
  - XSEDE Slurm Clusters across 20+ US universities.
  - European PRACE Slurm Systems for pan-european hpc access.
- ...
Counter-Examples:
- PBS Professional, which uses pbs-specific job scheduling.
- Torque Resource Manager, which provides torque-based batch scheduling.
- Grid Engine System, which implements grid-based resource management.
- Kubernetes System, which focuses on container orchestration rather than hpc workload management.
See: High-Performance Computing Workload Management System, Batch Job Scheduling System, Cluster Management System, Scientific Computing Platform, Parallel Computing System, Distributed Resource Control System, HPC Resource Allocation System, Supercomputing Infrastructure, Job Queuing System.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Slurm_HPC_Workload_Manager&oldid=978824"