HPC Job Requeuing

Incident Report for Dartmouth College

Resolved

Thank you
Posted Oct 14, 2024 - 13:37 EDT

Identified

We have identified the issue and have initially resolved the issue, but we are still monitoring the situation.
Posted Oct 07, 2024 - 10:10 EDT

Investigating

We are currently experiencing repeated requeuing of Slurm jobs across multiple nodes in our computing cluster.

Symptoms: Jobs are failing with the status NODE_FAIL and are requeued to other nodes. This behavior is affecting numerous jobs over the past month.

We are working with the vendor to remedy this issue.

If you experience job loss please email research.computing@dartmouth.edu
Posted Sep 30, 2024 - 11:14 EDT
This incident affected: Research Computing (HPC).