Question re global.cylc[platforms][<platform name>]hosts

I am trying to come up with the best way to describe the the environment of ECMWF’s HPC center.
I have this:

   [[ecmwf_login]]
        hosts = hpc-login
        job runner = background
        communication method = poll
        retrieve job logs = True
        install target = hpc-login

    [[ecmwf_compute]]
        hosts = hpc-login
        job runner = slurm
        retrieve job logs =True
        communication method = poll
        install target = hpc-login

But it’s not quite right. “hpc-login” is what we ssh to, but the actual hosts are named a[a-z]\d-\d{3,} , and compute nodes with a similar pattern. This causes issues with job running in bg on ecmwf_login. I’ve solved the some issue on our local MeteoFrance hpc:

    [[belenos-bg1]]
        hosts = belenoslogin1
        cylc path = /home/ext/mr/smer/turekg/miniconda3/envs/cylc/bin
        job runner = background
        retrieve job logs = True
        install target = belenos
        execution polling intervals = PT1M

    [[belenos-bg2]]
        hosts = belenoslogin2
        cylc path = /home/ext/mr/smer/turekg/miniconda3/envs/cylc/bin
        job runner = background
        retrieve job logs = True
        install target = belenos
        execution polling intervals = PT1M

    [[belenos-bg3]]
        hosts = belenoslogin3
        cylc path = /home/ext/mr/smer/turekg/miniconda3/envs/cylc/bin
        job runner = background
        retrieve job logs = True
        install target = belenos
        execution polling intervals = PT1M

[platform groups]

    [[belenos_login]]
        # pick one of these platforms to run operations on
        platforms = belenos-bg1, belenos-bg2, belenos-bg3

so if I designate a platform [[ a[a-z]\d-\d{3,} ]] what is hosts = ?
and then what would be platforms = ? for a platform group named [[ecmwf_login]] ?
Thanx!

Hi @gturek - I’ll leave this one to the UK team, as they might have specific knowledge on ECMWF platforms. Generally though, the background job runner does need a specific single host, because that’s the only place Cylc can “see” the job (but you can have multiple background platform defs for multiple hosts), unlike for batch systems like Slurm. It might help to say where you’ll run your Cylc schedulers - is that on the login nodes?

Hi Hilary, what we’re ultimately trying to do is to be able to run jobs from outside ECMWF, either from Mercator itself, but the ultimate goal would be run jobs from the digital platform EDITO https://www.edito.eu. So the schedulers are remote. There is another can of worms to be dealt with, which is the way connections to the ECMWF are set up (via 2FA and teleport). So far we’ve been able to run there with mixed results. Cannot run simple tasks in bg (lose “connectivity”) and have issue recovering comms if the 2FA validity window expires (12hrs).
We did this for one particular project, so concentrated on just getting it to work, but I want to do more testing to find out what the correct cylc configuration(s) would be, etc.
Who is the best person in the UK team to talk to?

@davidmatthews manages the deployment on ECMWF.

We setup Cylc on ECMWF - and you should be able to use that setup as a template - have a look at

cylc config --platforms

to see what we have setup.

General comments about platform setup

hosts refers to the hostname(s) of the server from which you submit your job. Slurm can submit to other hosts from there.

job runner = background will only allow your task to run on the host listed under hosts. Your [[ecmwf_compute]] looks correct. You should find that adding echo $HOSTNAME to your job scripts shows that the job is being sent by Slurm to another host.

[[ecmwf_login]] appears to be trying to run tasks as background on the HPC login node. Depending on how big your jobs are compared to the login node this may cause the system admin to tell you to stop and/or kill your jobs. I wonder if you might want something more like:

[[ecs]]
hosts = ecs-batch
job runner = slurm

… where again, Slurm farms the tasks out.

Hi Tim, not sure I quite understand. How do I get to cylc on ECMWF?
Login node jobs would be really inconsequntial such as key word substitution in namelists and such. I have dropped an email to Dave. Thanx!

Note: job runners and login/compute nodes

With batch systems such as PBS and Slurm we can submit jobs in one place (the login node) and have them run in another (the compute node).

However, with the “background” job runner, the login node and the compute node must be the same. This is because Cylc needs to know the host the job is running on in order to poll and kill the job as necessary.

The Met Office has already installed Cylc for use on ECMWF.

Users with Met Office SRS access can view the instructions here: https://code.metoffice.gov.uk/trac/scpn/wiki/ecmwf/cylc

Summary:

  • Setup ssh access to ECWMF via Teleport.
  • Add PATH="$PATH:~frmi/bin" to your .profile file to access the Cylc/Rose installations.
  • MO users run their workflows on MO systems, submitting the jobs to ECMWF.

I don’t have access to this platform, so can’t comment further. Dave is on leave at the moment, but can provide more information next week.

Here is an example configuration:

# background jobs
[[ecs-bg]]
    ssh command = ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no -oControlPath=~/.ssh/master-%l-%r@%h:%p
    copyable environment variables = FCM_VERSION
    submission polling intervals = PT1M
    execution polling intervals = PT1M
    execution time limit polling intervals = PT1M
    clean job submission environment = True
    retrieve job logs = True
    retrieve job logs max size = 32M
    retrieve job logs retry delays = PT10S, PT30S, PT3M
    install target = ec
    communication method = poll
    retrieve job logs command = rsync -a --no-p --no-g --chmod=ugo=rwX
    hosts = ecs-batch
    [[[meta]]]
        description = ECMWF ECS login node background job

# slurm jobs
[[ecs]]
    ssh command = ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no -oControlPath=~/.ssh/master-%l-%r@%h:%p
    copyable environment variables = FCM_VERSION
    submission polling intervals = PT1M
    execution polling intervals = PT1M
    execution time limit polling intervals = PT1M
    clean job submission environment = False
    retrieve job logs = True
    retrieve job logs max size = 32M
    retrieve job logs retry delays = PT10S, PT30S, PT3M
    install target = ec
    communication method = poll
    retrieve job logs command = rsync -a --no-p --no-g --chmod=ugo=rwX
    hosts = ecs-batch
    job runner = slurm
    [[[meta]]]
        description = ECMWF ECS Slurm job
1 Like

Thanx Oliver, that is very helpful.