HPC: Monitoring Progress & Aborting #5584

ax3l · 2025-01-21T23:28:41Z

In unsupervised (or: ideally they should be unsupervised, but I am checking the status every other minute) runs such as batched HPC execution, it is preferable that the operator does not have to be synchronously monitoring a running job.

In real-world loads of HPC systems (changing loads on networks, filesystems, OoM scenarios, changing software, etc.), it is not uncommon that a batched job starts up outside of regular working hours and in some cases, causes a hang until walltime. In some cases, this can be costly.

We should establish a mechanism (in job scripts) that programatically monitors progress / health of a simulation and if a configurable timeout is reached, aborts the simulation, first with sigterm (for backtrace generation) and then sigkill.

Possible Implementations

A very simple implementation could be to write some kind of status (e.g., the current time) into a file (e.g., from the I/O processor) every time step. In the batch job, a single polling process could check the time difference.

File-based I/O is of course far from ideal, e.g., due to sync, load, short time steps, or for I/O-free runs (e.g., optimization). Better might be to have a port open for health queries (could be later reused to query things like memory usage per MPI process, load, etc.) or to react on a POSIX signal and print something on a specific channel (e.g., stderr), like dd does.

The text was updated successfully, but these errors were encountered:

ax3l added the machine / system Machine or system-specific issue label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC: Monitoring Progress & Aborting #5584

HPC: Monitoring Progress & Aborting #5584

ax3l commented Jan 21, 2025 •

edited

Loading

HPC: Monitoring Progress & Aborting #5584

HPC: Monitoring Progress & Aborting #5584

Comments

ax3l commented Jan 21, 2025 • edited Loading

Possible Implementations

ax3l commented Jan 21, 2025 •

edited

Loading