Hi there everyone.
We are new to cylc and trying to sort out how to submit tasks via slurm jobs and running into an issue when using --nodelist slurm directive. We have things working for slurm jobs that coincidentally run on the same host we are running cylc install
, but not when we request a different node.
Our platform config:
[platforms]
[[kapua]]
hosts = localhost
job runner = slurm
install target = localhost
[install]
[[symlink dirs]]
[[[localhost]]]
run = /hub/scratch
Rationale and context: This is an experimental cluster with kap-xeon02
configured as a slurm controller and kap-xeon02
and kap-epyctest
as slurm nodes. Cylc + Cylchub is installed in kap-xeon02
, and we are installing cylc workflows directly there. We are using kap-xeon02
as a “login node”). The reason to use symlink dirs is that these are not headless nodes, so home directories are not in a shared filesystem, but /hub/scratch
is.
The workflow file is:
[scheduler]
UTC mode = True
[scheduling]
initial cycle point = now
[[xtriggers]]
delay3h = wall_clock(offset=PT3H)
delay8h = wall_clock(offset=PT8H)
[[graph]]
# repeat every 6 h starting at the initial cycle point
PT12H = """
@delay8h => roms_saus2d_pre => roms_saus2d_run => roms_saus2d_post
"""
# Repeat every 12 h starting 12 h after the initial cycle point.
+PT12H/PT12H = """
@delay8h => roms_saus2d_post[-PT12H] => roms_saus2d_pre => roms_saus2d_run => roms_saus2d_post
"""
[runtime]
[[roms_saus2d_pre]]
script = roms_preproc.sh
[[[environment]]]
MODEL_CONFIG = model.roms_gfs_saus2d.yml
[[roms_saus2d_run]]
script = roms_run.sh
platform = kapua
[[[environment]]]
ROMSBIN = ROMS2D_NO-TIDE_NO-OBC
SCRATCH_DIR = /hub/scratch/forecast/roms/gfs_saus2d
SLURM_TASKS = 12
[[[directives]]]
--mem = 8G
--ntasks = 12
--cpus-per-task = 1
--propagate = STACK
--partition = mainQ
--output = /hub/scratch/forecast/roms/gfs_saus2d/roms.out
--error = /hub/scratch/forecast/roms/gfs_saus2d/roms.err
--nodelist = kap-epyctest
[[roms_saus2d_post]]
script = roms_postproc.sh
[[[environment]]]
MODEL_CONFIG = model.roms_gfs_saus2d.yml
Everything runs fine without the --nodelist = kap-epyctest
directive. The slurm job is submitted to kap-xeon02
as there are enough resources and the workflow succeeds. But when we add --nodelist = kap-epyctest
, we get this error on job submission.
cat /hub/scratch/forecast/roms/gfs_saus2d/roms.err
/var/lib/slurm-llnl/slurmd/job00311/slurm_script: line 65: /home/metocean/cylc-run/ops-tiny/run1/.service/etc/job.sh: No such file or directory
/var/lib/slurm-llnl/slurmd/job00311/slurm_script: line 66: cylc__job__main: command not found
I expected the symlink dirs to fix that, but didn’t work. We have this global.cylc
file in /etc/cylc/flow/global.cylc
in both servers.
Is this the correct way of configuring the platform? Maybe cylc expects home directories to be in a shared filesystem?
Thanks in advance!