How to configure remote run host with mutliple job runners

Hi there, we want to be able to run some small tasks on the remote run host in background rather than with slurm. For this we set up two separate platforms in global.cylc with a different job runner
Seems to work fine, except that at clean time, we always get an error message as it cleans the first platform fine, but of course when it comes to the second one the directories are already gone, hence an error.

Hi,

We do this too. You just need to assign the two platforms the same install target so that Cylc knows they live on the same filesystem.

E.g, something along the lines of:

[platforms]
    [[HPC_bg]]
        job_runner = background
        install target = HPC
    [[HPC]]
        job_runner = slurm
        install target = HPC

Thanx! So what is exactly the difference between install target and hosts? Because we use hosts, but obviously the two directives are treated differently

Cylc needs to run a variety of operations on job hosts, e.g:

  • Job submission.
  • Job polling.
  • Job killing.
  • Cleaning

Terminology:

  • A platform represents a collection of nodes unified behind a job runner (e.g. slurm).
  • Operations (e.g. job submission and polling) can be performed on any of the hosts on the platform.
  • Cylc will pick a host at random (e.g. to submit a job to), if the host is down, Cylc will pick another host from the list.
  • The install target of the platform is the filesystem where the cylc-run directory is created (which may be symlinked onto another filesystem using “symlink dirs”).

Additionally, something I missed from my previous response:

  • The background job runner is local-only (e.g. you can’t submit a background job on one host and kill it on another).
  • A platform group represents a collection of platforms.
  • Cylc will randomly choose a platform (e.g. to submit a job to), if all hosts for the platform are down, Cylc will pick another platform from the group.

At the Met Office, we use a platform configuration along the lines of this for our HPCs:

# global.cylc

[platforms]
    [[HPC]]
        # submit jobs to SLURM
        job runner = slurm

        # pick one of these hosts to perform operations on
        hosts = login1, login2

        # name the filesystem so that Cylc
        # knows which platforms share filesystems
        install target = HPC

    [[HPC-login1-bg]]
        # submit jobs locally on the node
        job runner = background

        # run all operations on login1
        hosts = login1

        # this platform shares the same filesystem as the others
        install target = HPC

    [[HPC-login2-bg]]
        job runner = background
        hosts = login2
        install target = HPC

[platform groups]
    [[HPC-bg]]
        # pick one of these platforms to run operations on
        platforms = HPC-login1-bg, HPC-login2-bg

Workflows use these platforms like so:

[runtime]
    [[model]]
        # submit to slurm
        platform = HPC
        [[[directives]]]
            --time=10
            --tasks=20
            --mem=10Gb

    [[housekeep]]
        # run on one of the login nodes (I don't mind which one)
        platform = HPC-bg

    [[special-task]]
        # run on login1 (not something you usually need to do)
        platform = HPC-login1-bg

There’s a similar example in the platform configuration reference docs here: Platform Configuration — Cylc 8.3.4 documentation

This is especially good to know. As we currently use the generic “belenos” which lets the HPC assign the login node on the fly. As you point out this will cause issue for bg jobs if we want to stop/kill them. So it is best for us to specify platform group as in your case