Slurm heterogeneous job support in cylc

Hi

I am trying to use Slurm heterogeneous jobs (https://slurm.schedmd.com/heterogeneous_jobs.html) for running MPMD jobs. This involves batch directives like

#SBATCH --partition=standard
#SBATCH --nodes=28
#SBATCH --ntasks=1782
#SBATCH --tasks-per-node=64
#SBATCH --cpus-per-task=2
#SBATCH hetjob
#SBATCH --partition=standard
#SBATCH --nodes=7
#SBATCH --ntasks=864
#SBATCH --tasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH hetjob
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=1

When you put these in a cylc [[[directives]]] section it will discard the earlier repeated options.

Is there any way to do this in cylc currently?

Thanks.

Jeff.

Hi Jeff

I’m not sure that you can do this internally in Cylc right now: I think it would require some reworking to the way configurations are parsed to allow this to happen - I think it’s a pretty fundamental assumption at the moment that if you have the same config key a second time it over-writes the first one.

It doesn’t stop you writing your own job scripts to be run by Cylc, but that’s what Cylc is meant to do for you, so it doesn’t sit well with me to say that.

:frowning:

I have created and issue at https://github.com/cylc/cylc-flow/issues/3964

Tim

Hello,

Sadly I don’t know much about heterogeneous job submission. Do the subjobs all start at the same time from a heterogeneous submission?

What is the problem you are trying to solve using them:

  1. Are you trying to submit multiple independent jobs in a single submission for efficiency?
  2. Or are these sub-jobs part of an assemblage of inter-dependent parts running across multiple nodes (like you might configure on cloud systems like AWS)?

If the answer is (1), then this is the sort of thing you can get Cylc to do for you rather than having to configure yourself with a specific batch submission system. I would suggest using separate Cylc tasks. This is much more powerful as you can leverage the task model to configure inter-dependence, retries, view progress in the GUI and all the other goodies Cylc has to offer.

If the answer is (2) then I’m afraid this is beyond the scope of Cylc (at least for now) as Cylc sees a job as being the lowest-level component of a workflow. Jobs are atomic and cannot be split into sub-jobs. If you managed to get the Cylc Slurm module to submit a heterogeneous job, Cylc would get confused by the different sub-jobs messaging back their statuses and the model would break.

As an aside: We have some thoughts and plans on making the task-job model a little more flexible in Cylc9 (or perhaps latter Cylc8 versions) to allow doing things like batching multiple tasks into a single job submission to reduce the queue time and job-submission overheads for small jobs whilst preserving task-level granularity.

Oliver

1 Like

Hi

Thanks Tim for creating the github issue.

I’m using these options to run a MPMD job, in this case a coupled run with separate executables for atmosphere, ocean and IO server. These executables run at the same time and share a MPI communicator. This is just like how you would run on other systems but slurm is using the batch directives to allocate resources to each executable or hetjob. The job flow is

uan01:jwc$ sbatch coupled_run_hetjob.job
Submitted batch job 50547
uan01:jwc$ squeue -j 50547
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
50547+0 standard coupled_ jwc R 4:57 28 nid[001063-001080,001401-001410]
50547+1 standard coupled_ jwc R 4:57 7 nid[001002-001005,001411-001413]
50547+2 standard coupled_ jwc R 4:57 1 nid001414

In the batch script the job is run like this

srun --het-group=0 --hint=nomultithread --distribution=block:block --export=all,OMP_NUM_THREADS=2,OMP_PLACES=cores toyatm : --het-group=1 --cpu-bind=cores --export=all,OMP_NUM_THREADS=1 toyoce : --het-group=2 --cpu-bind=cores --export=all,OMP_NUM_THREADS=1 xios.x

I think cylc should be ok with this because as far as I know this all treated as a single slurm job.

Thanks for your interest in this matter.

Jeff.

Hello again,

[edit] My misunderstanding on how the jobs are run (via srun from the lead job rather than each sharing the same script) which means this is simpler than I thought.

I’ve now had a trawl through the Slurm documentation and a quick think, I think this is something we could support this in a future release of Cylc8 (and potentially Cylc7).

I think cylc should be ok with this because as far as I know this all treated as a single slurm job.

What Slurm refers to as a “single job” in this context is a “meta job” which represents a collection of “subjobs” each with their own heterogeneous ID. It looks like this meta job can be used for monitoring and control just like any other job ID which makes things nice and easy for Cylc.

The remaining issue is that the job script that Cylc submits to Slurm contains logic for tracking the job through it’s lifetime, trapping errors, returning statuses, etc. This system is independent of Slurm and universal for all batch systems Cylc supports. When each of the subjobs in the heterogeneous submission runs it will try to contact the Cylc scheduler with updates about it’s progress and will fight over the job status file. These updates will clash and cause Cylc to give erroneous task/job states and could potentially corrupt the status file.

I think this is something that we could support if we make the first subjob the “lead subjob” and disable Cylc’s job tracking in the remaining subjobs based on the value of Slurm environment variables.

This means that from the perspective of Cylc the whole “meta job” will succeed or fail when the first subjob succeeds or fails. This would require some minor alterations to the Cylc jobscript but should be simple.

Tim has opened an issue on the Cylc Flow repository, I’ve dumped this implementation outline on that issue - https://github.com/cylc/cylc-flow/issues/3964#issuecomment-734361717

Sadly we are currently all engaged in delivering Cylc8 so aren’t going to be able to commit time to this right now so I’ve pinned it against the “some day” milestone, however, we do welcome contributions and have received batch system support changes from Cylc sites in the past.

Oliver

The workaround given in the github issue has enabled me to run a MPMD coupled model using slurm heterogeneous jobs. Thanks.

3 Likes

For the record, in case others are following this thread, the concerns outlined above thankfully turned out not to be a problem. The Slurm documentation is woefully thin on how heterogenous jobs work, but fortunately it does not run the whole batch script for each job component, so there is no problem with job status message in Cylc - we just had to solve the much easier problem of repeated Slurm directives.

So, heterogeneous Slurm jobs are supported in the latest (last week) Cylc release: cylc-7.9.2. (And documented in the Cylc User Guide).

Regards,
Hilary