Slurm heterogeneous job support in cylc

Hello again,

[edit] My misunderstanding on how the jobs are run (via srun from the lead job rather than each sharing the same script) which means this is simpler than I thought.

I’ve now had a trawl through the Slurm documentation and a quick think, I think this is something we could support this in a future release of Cylc8 (and potentially Cylc7).

I think cylc should be ok with this because as far as I know this all treated as a single slurm job.

What Slurm refers to as a “single job” in this context is a “meta job” which represents a collection of “subjobs” each with their own heterogeneous ID. It looks like this meta job can be used for monitoring and control just like any other job ID which makes things nice and easy for Cylc.

The remaining issue is that the job script that Cylc submits to Slurm contains logic for tracking the job through it’s lifetime, trapping errors, returning statuses, etc. This system is independent of Slurm and universal for all batch systems Cylc supports. When each of the subjobs in the heterogeneous submission runs it will try to contact the Cylc scheduler with updates about it’s progress and will fight over the job status file. These updates will clash and cause Cylc to give erroneous task/job states and could potentially corrupt the status file.

I think this is something that we could support if we make the first subjob the “lead subjob” and disable Cylc’s job tracking in the remaining subjobs based on the value of Slurm environment variables.

This means that from the perspective of Cylc the whole “meta job” will succeed or fail when the first subjob succeeds or fails. This would require some minor alterations to the Cylc jobscript but should be simple.

Tim has opened an issue on the Cylc Flow repository, I’ve dumped this implementation outline on that issue - Support Slurm Heterogenous Jobs · Issue #3964 · cylc/cylc-flow · GitHub

Sadly we are currently all engaged in delivering Cylc8 so aren’t going to be able to commit time to this right now so I’ve pinned it against the “some day” milestone, however, we do welcome contributions and have received batch system support changes from Cylc sites in the past.

Oliver