You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In unsupervised (or: ideally they should be unsupervised, but I am checking the status every other minute) runs such as batched HPC execution, it is preferable that the operator does not have to be synchronously monitoring a running job.
In real-world loads of HPC systems (changing loads on networks, filesystems, OoM scenarios, changing software, etc.), it is not uncommon that a batched job starts up outside of regular working hours and in some cases, causes a hang until walltime. In some cases, this can be costly.
We should establish a mechanism (in job scripts) that programatically monitors progress / health of a simulation and if a configurable timeout is reached, aborts the simulation, first with sigterm (for backtrace generation) and then sigkill.
Possible Implementations
A very simple implementation could be to write some kind of status (e.g., the current time) into a file (e.g., from the I/O processor) every time step. In the batch job, a single polling process could check the time difference.
File-based I/O is of course far from ideal, e.g., due to sync, load, short time steps, or for I/O-free runs (e.g., optimization). Better might be to have a port open for health queries (could be later reused to query things like memory usage per MPI process, load, etc.) or to react on a POSIX signal and print something on a specific channel (e.g., stderr), like dd does.
The text was updated successfully, but these errors were encountered:
In unsupervised (or: ideally they should be unsupervised, but I am checking the status every other minute) runs such as batched HPC execution, it is preferable that the operator does not have to be synchronously monitoring a running job.
In real-world loads of HPC systems (changing loads on networks, filesystems, OoM scenarios, changing software, etc.), it is not uncommon that a batched job starts up outside of regular working hours and in some cases, causes a hang until walltime. In some cases, this can be costly.
We should establish a mechanism (in job scripts) that programatically monitors progress / health of a simulation and if a configurable timeout is reached, aborts the simulation, first with sigterm (for backtrace generation) and then sigkill.
Possible Implementations
A very simple implementation could be to write some kind of status (e.g., the current time) into a file (e.g., from the I/O processor) every time step. In the batch job, a single polling process could check the time difference.
File-based I/O is of course far from ideal, e.g., due to sync, load, short time steps, or for I/O-free runs (e.g., optimization). Better might be to have a port open for health queries (could be later reused to query things like memory usage per MPI process, load, etc.) or to react on a POSIX signal and print something on a specific channel (e.g., stderr), like
dd
does.The text was updated successfully, but these errors were encountered: