But it’s not quite right. “hpc-login” is what we ssh to, but the actual hosts are named a[a-z]\d-\d{3,} , and compute nodes with a similar pattern. This causes issues with job running in bg on ecmwf_login. I’ve solved the some issue on our local MeteoFrance hpc:
so if I designate a platform [[ a[a-z]\d-\d{3,} ]] what is hosts = ?
and then what would be platforms = ? for a platform group named [[ecmwf_login]] ?
Thanx!
Hi @gturek - I’ll leave this one to the UK team, as they might have specific knowledge on ECMWF platforms. Generally though, the background job runner does need a specific single host, because that’s the only place Cylc can “see” the job (but you can have multiple background platform defs for multiple hosts), unlike for batch systems like Slurm. It might help to say where you’ll run your Cylc schedulers - is that on the login nodes?
Hi Hilary, what we’re ultimately trying to do is to be able to run jobs from outside ECMWF, either from Mercator itself, but the ultimate goal would be run jobs from the digital platform EDITO https://www.edito.eu. So the schedulers are remote. There is another can of worms to be dealt with, which is the way connections to the ECMWF are set up (via 2FA and teleport). So far we’ve been able to run there with mixed results. Cannot run simple tasks in bg (lose “connectivity”) and have issue recovering comms if the 2FA validity window expires (12hrs).
We did this for one particular project, so concentrated on just getting it to work, but I want to do more testing to find out what the correct cylc configuration(s) would be, etc.
Who is the best person in the UK team to talk to?
We setup Cylc on ECMWF - and you should be able to use that setup as a template - have a look at
cylc config --platforms
to see what we have setup.
General comments about platform setup
hosts refers to the hostname(s) of the server from which you submit your job. Slurm can submit to other hosts from there.
job runner = background will only allow your task to run on the host listed under hosts. Your [[ecmwf_compute]] looks correct. You should find that adding echo $HOSTNAME to your job scripts shows that the job is being sent by Slurm to another host.
[[ecmwf_login]] appears to be trying to run tasks as background on the HPC login node. Depending on how big your jobs are compared to the login node this may cause the system admin to tell you to stop and/or kill your jobs. I wonder if you might want something more like:
Hi Tim, not sure I quite understand. How do I get to cylc on ECMWF?
Login node jobs would be really inconsequntial such as key word substitution in namelists and such. I have dropped an email to Dave. Thanx!
With batch systems such as PBS and Slurm we can submit jobs in one place (the login node) and have them run in another (the compute node).
However, with the “background” job runner, the login node and the compute node must be the same. This is because Cylc needs to know the host the job is running on in order to poll and kill the job as necessary.