Simple Linux Utility for Resource Management (SLURM)

EDUCADO-MWGaiaDN Training School on Astro-AI and Machine Learning

Bernd Doser

Heidelberg Institute for Theoretical Studies (HITS)

March 5, 2026

This workshop is available at https://github.com/BerndDoser/EDUCADO_Slurm_Workshop

Agenda

Slurm Basics
- Understanding HPC Clusters, Partitions and Resources
- Submitting Jobs and Monitoring Progress
- Requesting resources (CPUs, memory, time)

MPI with Slurm
- Parallel jobs across multiple nodes

Slurm with StreamFlow
- Heterogeneous jobs with different resource requirements

Why HPC?

Hundreds to thousands of CPU/GPU cores
Fast interconnects (InfiniBand ~400 GB/s, NVLink ~3 TB/s, etc.)
Shared, managed storage (~9 PB @ HITS, 417 TB GPU Direct Storage)
High-performance software stack (MPI, CUDA, optimized libraries)
Job scheduling and resource management (SLURM, PBS, etc.)
Access to specialized hardware (FPGAs, TPUs, etc.)

HITS Cascade Cluster (used for hands-on session)

The Cascade cluster provides over 3,000 CPU cores, which can be used for general-purpose parallel workloads and CPU-bound applications.

300 NVIDIA RTX 2080 Super GPUs makes it well-suited for molecular dynamics simulations, machine learning workflows, and large-scale linear algebra computations.

HITS Genoa Cluster

The Genoa cluster represents the next generation of high-performance computing. Equipped with state-of-the-art AMD EPYC Genoa, Bergamo, and Rome processors as well as the latest NVIDIA H200 and A100 GPUs, it provides a versatile platform for both CPU-intensive and GPU-accelerated research workloads.

What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager designed for Linux clusters of all sizes.

Yoo, Jette, and Grondona (2003), Jette and Wickberg (2023)

Keeps track of all nodes, CPUs, GPUs, and memory
Enforces fair-share policies and quotas
Accepts job submissions from users (sbatch, srun, salloc)
Queues jobs and dispatches them when resources are free
Supports priorities, reservations, and backfill scheduling

SLURM is now part of NVIDIA’s HPC software stack (December 2025).

Slurm Architecture

%%{ init: { "themeVariables": { "fontSize": "12px" }, "flowchart": { "nodeSpacing": 30, "rankSpacing": 60 } } }%%
flowchart LR
    U(["👤 User"]) -- sbatch / srun --> LC["Login Node<br/>slurmctld"]
    LC -- dispatches --> N1["Compute Node<br/>slurmd"]
    LC -- dispatches --> N2["Compute Node<br/>slurmd"]
    LC -- dispatches --> N3["Compute Node<br/>slurmd"]
    DB[("SlurmDB<br/>Accounting")] <--> LC

Login node — where users connect via SSH and submit jobs; never run heavy work here
Compute nodes — where jobs actually execute; managed by the slurmd daemon
SlurmDB — optional accounting database for job history and fairshare
slurmctld — central controller that manages job scheduling and resource allocation

The Typical SLURM Workflow

SSH into the login node
Prepare your script or application
Write a batch script with #SBATCH resource directives
Submit with sbatch job.sh → receive a job ID
Monitor with squeue, sinfo, scontrol show job <id>
Retrieve results from output files once the job finishes
Debug failures with log files and sacct

`sinfo`: Cluster status

[stud99@cascade-login ~]$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
cascade.p*    up 1-00:00:00      1  maint cascade-003
cascade.p*    up 1-00:00:00      3 drain* cascade-[012,028,030]
cascade.p*    up 1-00:00:00      7    mix cascade-[008,027,042,044,054-055,062]
cascade.p*    up 1-00:00:00    111  alloc cascade-[001-002,004-007,009-011,018-026,029,031-039,041,045,049-051,053,057,065-089,092,097-144,147-148]
cascade.p*    up 1-00:00:00     26   idle cascade-[013-017,040,043,046-048,052,056,058-061,063-064,090-091,093-096,145-146]
debug.p       up    1:00:00      2   idle cascade-[149-150]
karl.p        up 7-00:00:00      1    mix karl

The STATE column indicates whether nodes are idle, allocated, in maintenance, or in a mixed state (partially allocated).

`sbatch`: Defining a Job

job.sh

#!/bin/bash
#SBATCH --job-name=test              # Job name
#SBATCH --partition=cascade.p        # Partition to run on
#SBATCH --nodes=1                    # Number of nodes
#SBATCH --ntasks=1                   # Number of tasks per node
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --mem=8G                     # Memory per node
#SBATCH --time=00:30:00              # Walltime (hh:mm:ss)
#SBATCH --output=test-%j.out         # Standard output (%j = job ID)

# run your command
echo "Hello from node $(hostname)"

#SBATCH directives to specify job parameters
Slurm will allocate the appropriate resources and execute the job

`sbatch`: Submit a Job

The batch script can be submitted to the Slurm scheduler using the sbatch command:

sbatch job.sh

which will generate a confirmation message

Submitted batch job 18491592

indicating that the job was submitted with the JOBID, which can be used to monitor or cancel the job.

`squeue`: Monitor the Job Queue

[stud99@cascade-login ~]$ squeue --me
JOBID        PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
18491592   cascade.p  hello_wo  stud99  R       0:11      1 cascade-029
18491601   cascade.p  array_jo  stud99 PD       0:00      1 (Priority)

Column	Description
`JOBID`	Unique job identifier
`ST`	State: `R` = Running, `PD` = Pending, `CG` = Completing
`TIME`	Elapsed run time
`NODELIST(REASON)`	Node(s) assigned, or reason if pending

Useful options: --me (own jobs), -u <user>, -p <partition>, --start (estimated start time)

`scontrol`: Monitor Job Status

scontrol can be used to obtain more detailed information about a job:

scontrol show job <job-id>

Example output:

JobId=18491592 JobName=hello_world
   UserId=stud99(20199) GroupId=aliens(20000) MCS_label=N/A
   Priority=3865470565 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:11 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2026-02-20T14:51:17 EligibleTime=2026-02-20T14:51:17
   AccrueTime=2026-02-20T14:51:17
   StartTime=2026-02-20T14:51:18 EndTime=2026-02-20T14:56:18 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-02-20T14:51:18 Scheduler=Main
   Partition=cascade.p AllocNode:Sid=cascade-login:1286819
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cascade-029
   BatchHost=cascade-029
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=100M,node=1,billing=1
   AllocTRES=cpu=2,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=100M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/stud99/EDUCADO_Slurm_Workshop/hands-on/exercise_1/hello.sh
   WorkDir=/home/stud99/EDUCADO_Slurm_Workshop/hands-on/exercise_1
   StdErr=/home/stud99/EDUCADO_Slurm_Workshop/hands-on/exercise_1/hello_18491592.out
   StdIn=/dev/null
   StdOut=/home/stud99/EDUCADO_Slurm_Workshop/hands-on/exercise_1/hello_18491592.out
   Power=

`srun`: Interactive Mode

Allocates a node and drops you into an interactive shell — useful for debugging, testing, or exploring the environment.

srun --partition=cascade.p --time=01:00:00 --pty bash

Common options:

Option	Description
`--pty bash`	Open an interactive terminal
`--cpus-per-task=4`	Request multiple CPUs
`--mem=8G`	Request memory
`--gres=gpu:1`	Request a GPU
`--x11`	Enable X11 forwarding

Slurm Tiny Helper

Cancel a running job: scancel <jobid>
Check past jobs (if available): sacct -u $USER --starttime=today

Email notifications:

#SBATCH --mail-type=END  # or ALL
#SBATCH --mail-user=your@email.com

Job dependency: Submit two jobs where the second only runs after the first succeeds:
```
JOB1=$(sbatch --parsable hello.sh)
sbatch --dependency=afterok:$JOB1 hello.sh
```

Fairshare

Fairshare ensures equitable cluster access by prioritizing users who have used fewer resources recently.

\[ \text{Priority} = w_{fs} \cdot F + w_{age} \cdot A + w_{qos} \cdot Q + \ldots \]

Fairshare score $F \in [0, 1]$: closer to 1 means higher priority (less recent usage)
Score decays over time — past usage matters less as time passes

# View your fairshare score
sshare -u $USER

# See job priority breakdown for pending jobs
sprio -l -u $USER

Submit jobs when your fairshare score is high to get shorter wait times.

Quality of Service (QoS)

Most clusters use a standard set of QoS levels:

debug: Very high priority, but very short walltime (e.g., 30 mins) and small node limits. Great for testing code before a big run.
normal: The default. Balanced priority and standard limits.
long: Lower priority, but allows jobs to run for 7+ days.
scavenger: Zero priority. Jobs only run if the cluster is empty, and they can be killed (preempted) the moment a “paying” job arrives.

To submit a job with a specific QoS:

sbatch --qos=high_priority my_script.sh

Why is my job still waiting?

High recent usage lowers your Fairshare score. Use sprio -l -u $USER to see where your job ranks among all pending jobs.

Job Arrays

If you have a large number of similar jobs (e.g., parameter sweeps), you can use job arrays to submit them efficiently.

#!/bin/bash
#SBATCH --array=0-9
#SBATCH --time=01:00:00
#SBATCH --job-name=array_job

echo "Running task $SLURM_ARRAY_TASK_ID"

MPI with Slurm

#!/bin/bash
#SBATCH --nodes=2             # Request 2 nodes
#SBATCH --ntasks-per-node=4   # Run 4 tasks per node (total 8 tasks)
#SBATCH --time=1:00:00        # Set a time limit to 1 hour

module load openmpi           # Load your MPI module
srun ./my_mpi_program         # srun handles the launching

Use srun (instead of mpirun or mpiexec) to launch parallel tasks across nodes.

It directly communicates with Slurm to launch processes, propagates signals correctly (like cancelling a job), and ensures process binding respects your requests.

GPU Jobs with Slurm

GPUs are managed as Generic REsources (GRES) in Slurm.

#SBATCH --gres=gpu:a100:2       # 2× NVIDIA A100 per node
#SBATCH --gres=gpu:v100:1       # 1× NVIDIA V100 per node
#SBATCH --gres=gpu:rtx3090:4    # 4× RTX 3090 per node

Find available GPU types with

sinfo -o "%P %G" | column -t

Check actual GPU allocation inside your job

echo $CUDA_VISIBLE_DEVICES    # Slurm sets this automatically
nvidia-smi                    # shows only your allocated GPUs

Heterogeneous Jobs

Sometimes a single job needs different resources for different parts (e.g., a master process on a CPU node and worker processes on GPU nodes).

Key Features:

Combine multiple job specifications into one valid job.
Each component (pack group) has its own resources (nodes, memory, partition).
Submitted as a single entity with one JOBID.

#!/bin/bash
#SBATCH --partition=genoa.p --nodes=2 --ntasks-per-node=32
#SBATCH hetjob
#SBATCH --partition=genoa-deep.p --nodes=1 --gpus=4

srun --het-group=0 ./preprocess
srun --het-group=1 ./train

Slurm REST API

Workflow managers and web portals submitting jobs on behalf of users
CI/CD pipelines triggering HPC jobs automatically
Monitoring dashboards polling job and node state
Python scripts submitting jobs directly

Method	Endpoint	Description
`GET`	`/slurm/v0.0.42/jobs`	List all jobs
`GET`	`/slurm/v0.0.42/job/{job_id}`	Get details of a specific job
`POST`	`/slurm/v0.0.42/job/submit`	Submit a new job
`DELETE`	`/slurm/v0.0.42/job/{job_id}`	Cancel a job
`GET`	`/slurm/v0.0.42/nodes`	List all nodes and their state
`GET`	`/slurm/v0.0.42/partitions`	List partitions

Slurm Reservation

A reservation pre-allocates cluster resources (nodes, CPUs, time window) for exclusive use by a specific set of users or accounts.

scontrol show reservation  # List all reservations

Our reservation of 10 nodes (200 cores):

ReservationName=aliens_99 StartTime=2026-03-05T06:00:00 EndTime=2026-03-06T06:00:00 Duration=1-00:00:00
   Nodes=cascade-[001-010] NodeCnt=10 CoreCnt=200 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=400
   Users=(null) Groups=aliens Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Users submit jobs into a reservation with:

sbatch --reservation=aliens_99 job.sh

Slurm with Containers

Ship complex software stacks in a portable container image
Enable reproducible research and portable workflows
Guarantee identical environments across development and production

Apptainer is the de-facto standard on HPC clusters.

#!/bin/bash
#SBATCH --job-name=container_job
#SBATCH --output=container_%j.out
#SBATCH --time=00:30:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

module load apptainer

apptainer exec --bind /data:/data my_image.sif python3 /app/run.py

StreamFlow

StreamFlow is a general framework for workflow orchestration
Relies on the Common Workflow Language (CWL)
and connects the CWL with a deployment model
Workflows can be deployed to different environments:
- Slurm
- Kubernetes
- Containers / Apptainer
Designed for heterogeneous jobs with varying resource requirements
Example for astronomy data processing pipelines: LOFAR-VLBI Pilot

Common Workflow Language (CWL)

workflow.cwl

class: Workflow

inputs:
  epochs: int
  train_dataset: Directory
  eval_dataset: Directory
  train_script: File
  eval_script: File

outputs:
  model:
    type: File
    outputSource: train/model
  accuracy:
    type: string
    outputSource: eval/accuracy

steps:
  train:
    run: train.cwl
    in:
      train_script:train_script
      epochs:epochs
      dataset:train_dataset
    out:
      - model
      - loss
  evaluate:
    run: eval.cwl
    in:
      eval_script:eval_script
      dataset:eval_dataset
      model:train/model
    out:
      - accuracy

Declare the workflow in a YAML file
- inputs
- outputs
- steps
Inputs and outputs are passed between steps and must be compatible
Workflows can be shared and reused

StreamFlow Architecture

StreamFlow Deployment

StreamFlow connects the CWL workflow to a deployment model

workflows:
  workflow1:
    type: cwl
    config:
      file: cwl/workflow.cwl
      settings: cwl/settings.yml
    bindings:
      - step: /train
        target:
          deployment: slurm
          service: genoa-hopper-gpu
      - step: /eval
        target:
          deployment: slurm
          service: cascade

deployments:
  slurm:
    type: slurm
    config:
      maxConcurrentJobs: 10
      services:
        cascade:
          partition: cascade.p
          nodes: "1"
          mem: 12gb
        genoa-hopper-gpu:
          partition: genoa-hopper.p
          nodes: "1"
          gpus: "h200:2"
          mem: 96gb

HPC vs AI/ML Infrastructure

HPC workload

Simulations, modeling, and data analysis
Large clusters of CPUs and high-speed interconnects
Parallel computing (MPI)
Scheduling with Slurm
Bare-metal compilation
Distributed file storage

AI/ML workload

Training and inference of models
Data-intensive tasks
Numerous Matrix operations requires accelerators (GPU, TPU, ..)
Cloud-native deployment
Kubernetes and containers
Object storage

Hardware and software requirements differ significantly based on the nature of their workloads and infrastructure.

Summary

SLURM is a widely used job scheduler that manages resource allocation and job scheduling on HPC clusters.
Understanding the architecture of SLURM and how to submit and monitor jobs is essential for effectively utilizing HPC resources.
Containers / Apptainer can be used to create reproducible environments for HPC applications.
StreamFlow provides a framework for managing heterogeneous jobs with varying resource requirements.
HPC and AI/ML workloads have different hardware and software requirements → Slinky

Thank you for your attention!

References

Jette, Morris A., and Tim Wickberg. 2023. “Architecture of the Slurm Workload Manager.” In Job Scheduling Strategies for Parallel Processing, edited by Dalibor Klusáček, Julita Corbalán, and Gonzalo P. Rodrigo, 3–23. Cham: Springer Nature Switzerland.

Yoo, Andy B., Morris A. Jette, and Mark Grondona. 2003. “SLURM: Simple Linux Utility for Resource Management.” In Job Scheduling Strategies for Parallel Processing, edited by Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, 44–60. Berlin, Heidelberg: Springer Berlin Heidelberg.

Hands-on session

Setup

ssh -i <path_to_your_private_key> <username>@cascade-extern.h-its.org

The username studXX and the private key were provided via email.

Clone the Workshop Repository and navigate to the hands-on directory:

git clone https://github.com/BerndDoser/EDUCADO_Slurm_Workshop.git
cd EDUCADO_Slurm_Workshop/hands-on

Exercise 1: Explore the cluster

Run the following commands and answer the questions below:

sinfo
squeue
scontrol show partition

Questions:

How many partitions does the cluster have? What are their names?
How many nodes are in each partition?
Are any jobs currently running? How can you tell?
What is the maximum walltime allowed on the default partition?

Try sinfo -N -l for a node-level view and scontrol show partition <name> for details on a specific partition.

Exercise 1: Explore the cluster - Solution

How many partitions does the cluster have? What are their names?

cascade.p (default), debug.p, karl.p
How many nodes are in each partition?

cascade.p: 148 nodes, debug.p: 2 nodes, karl.p: 1 node
Are any jobs currently running? How can you tell?

The squeue command shows the jobs in the queue.
What is the maximum walltime allowed on the default partition?

cascade.p has a maximum walltime of 24 hours. Default time limit is 2 hours.

Exercise 2: Write your first batch script

Submit and observe

sbatch --reservation=aliens_99 hello.sh

Now quickly (within 30 seconds!) run several times:

squeue -u $USER

Questions:

What is your job ID?
What state is your job in? (PD = pending, R = running, CG = completing)
On which node is it running?

After the job completes, inspect the output:

cat hello_<jobid>.out

Does the hostname match what squeue reported?

Exercise 3: Submit a buggy script

Try to submit it:

sbatch --reservation=aliens_99 buggy.sh

Questions:

Did the job submit successfully? What error did you get?
Fix it and resubmit. What happens now?
After it runs, check both .out and .err files. What error do you see?
What is the hidden integer?

Exercise 4: Experiment with resources

Submit the script resources.sh and use scontrol to inspect:

sbatch --reservation=aliens_99 resources.sh
scontrol show job <jobid>

Questions:

Find the NumCPUs, MinMemoryNode, and TimeLimit fields in the scontrol output. Do they match what you requested?
What happens if you request more memory than the node has? Try changing --mem=9999G and resubmit. What error do you get?
Request 1 and 2 GPUs and check the output of nvidia-smi in the job. Do you see the expected number of GPUs allocated to your job?

Exercise 4: Resources - Solution

scontrol output with --mem=512M:

RunTime=00:00:20 TimeLimit=00:02:00 TimeMin=N/A
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
MinCPUsNode=2 MinMemoryNode=512M MinTmpDiskNode=0

–mem=9999G:

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

nvidia-smi output with `--gres=gpu:1`:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 ...    On  |   00000000:17:00.0 Off |                  N/A |
| 30%   29C    P8             16W /  250W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Exercise 5: Job Arrays

You need to process 5 “datasets” (we’ll simulate them). Instead of submitting 5 separate jobs, you’ll use a job array.

Submit and monitor

sbatch --reservation=aliens_99 array_job.sh
squeue -u $USER

Questions:

How many jobs appear in squeue? What do the job IDs look like?
What is the difference between %A and %a in the output filename?
After all jobs complete, list the output files. Do you have 5 result files and 5 log files?

Exercise 5: Job Arrays - Solution

What is the difference between %A and %a in the output filename?

%A is the job array ID, which is the same for all jobs in the array.
%a is the job array index, which is unique for each job in the array

Exercise 6: MPI Job

You will compile and run a simple MPI Hello World program across multiple tasks.

Compile the program

cd hands-on/exercise_6
. cascade-gompi-2025b.sh        # Load OpenMPI
mpicc -o mpi_hello mpi_hello.c  # Compile

Submit the job

sbatch --reservation=aliens_99 mpi_job.sh

Questions:

How many lines appear in the output file? Does each MPI rank print its message?
Which ranks run on which node? Do all ranks share the same node or are they spread across nodes?
Change --ntasks to 8 and resubmit. How does the output change?

Exercise 6: MPI Job - Solution

Output:

Job ID:        18512769
Nodes:         cascade-[127-128]
Total tasks:   4

Hello from rank 0 of 4 on node cascade-127.cluster
Hello from rank 1 of 4 on node cascade-127.cluster
Hello from rank 3 of 4 on node cascade-128.cluster
Hello from rank 2 of 4 on node cascade-128.cluster

Exercise 7: Streamflow

Prepare the virtual environment

module load Python
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Submit the job

streamflow run streamflow.yml

Questions:

Separate the two steps into two different services. What changes do you need to make in the workflow and deployment definitions?

Exercise 7: Streamflow - Solution

workflows:
  ...
    bindings:
      - step: /say_hello
        target:
          deployment: slurm
          service: cascade
      - step: /train
        target:
          deployment: slurm
          service: cascade-gpu

deployments:
  slurm:
    ...
      services:
        cascade:
          partition: cascade.p
          nodes: "1"
          mem: 1gb
        cascade-gpu:
          partition: cascade.p
          nodes: "1"
          gpus: "1"
          mem: 1gb

Simple Linux Utility for Resource Management (SLURM)

Agenda

Why HPC?

HITS Cascade Cluster (used for hands-on session)

HITS Genoa Cluster

What is SLURM?

Slurm Architecture

The Typical SLURM Workflow

sinfo: Cluster status

sbatch: Defining a Job

sbatch: Submit a Job

squeue: Monitor the Job Queue

scontrol: Monitor Job Status

srun: Interactive Mode

Slurm Tiny Helper

Fairshare

Quality of Service (QoS)

Job Arrays

MPI with Slurm

GPU Jobs with Slurm

Heterogeneous Jobs

Slurm REST API

Slurm Reservation

Slurm with Containers

StreamFlow

Common Workflow Language (CWL)

StreamFlow Architecture

StreamFlow Deployment

HPC vs AI/ML Infrastructure

HPC vs AI/ML Infrastructure

Summary

Thank you for your attention!

References

Hands-on session

Setup

Exercise 1: Explore the cluster

Exercise 1: Explore the cluster - Solution

Exercise 2: Write your first batch script

Exercise 3: Submit a buggy script

Exercise 4: Experiment with resources

Exercise 4: Resources - Solution

nvidia-smi output with --gres=gpu:1:

Exercise 5: Job Arrays

Exercise 5: Job Arrays - Solution

Exercise 6: MPI Job

Exercise 6: MPI Job - Solution

Exercise 7: Streamflow

Exercise 7: Streamflow - Solution

`sinfo`: Cluster status

`sbatch`: Defining a Job

`sbatch`: Submit a Job

`squeue`: Monitor the Job Queue

`scontrol`: Monitor Job Status

`srun`: Interactive Mode

nvidia-smi output with `--gres=gpu:1`: