Slurm heterogeneous job support in cylc

oliver.sanders · November 26, 2020, 3:50pm

Hello again,

[edit] My misunderstanding on how the jobs are run (via srun from the lead job rather than each sharing the same script) which means this is simpler than I thought.

I’ve now had a trawl through the Slurm documentation and a quick think, I think this is something we could support this in a future release of Cylc8 (and potentially Cylc7).

I think cylc should be ok with this because as far as I know this all treated as a single slurm job.

What Slurm refers to as a “single job” in this context is a “meta job” which represents a collection of “subjobs” each with their own heterogeneous ID. It looks like this meta job can be used for monitoring and control just like any other job ID which makes things nice and easy for Cylc.

The remaining issue is that the job script that Cylc submits to Slurm contains logic for tracking the job through it’s lifetime, trapping errors, returning statuses, etc. This system is independent of Slurm and universal for all batch systems Cylc supports. When each of the subjobs in the heterogeneous submission runs it will try to contact the Cylc scheduler with updates about it’s progress and will fight over the job status file. These updates will clash and cause Cylc to give erroneous task/job states and could potentially corrupt the status file.

I think this is something that we could support if we make the first subjob the “lead subjob” and disable Cylc’s job tracking in the remaining subjobs based on the value of Slurm environment variables.

This means that from the perspective of Cylc the whole “meta job” will succeed or fail when the first subjob succeeds or fails. This would require some minor alterations to the Cylc jobscript but should be simple.

Tim has opened an issue on the Cylc Flow repository, I’ve dumped this implementation outline on that issue - Support Slurm Heterogenous Jobs · Issue #3964 · cylc/cylc-flow · GitHub

Sadly we are currently all engaged in delivering Cylc8 so aren’t going to be able to commit time to this right now so I’ve pinned it against the “some day” milestone, however, we do welcome contributions and have received batch system support changes from Cylc sites in the past.

Oliver

Topic		Replies	Views
Symlink dirs not working for slurm job request out of localhost Cylc Support	3	42	April 15, 2025
Polling (slurm hetjobs) Cylc Support	2	393	December 1, 2020
Cylc8 migration issues Cylc 8 Migration	36	608	October 5, 2023
Slurm queue directive Cylc Development	5	246	September 28, 2023
Job submission fails with "Permission denied: 'nohup'" Cylc Support	4	40	October 7, 2024

Slurm heterogeneous job support in cylc

Related topics