Hello, I’m finally gettin round to migrating to cylc8, but so far I haven’t been able to get even the simplest tasks working.
The documentation is rather sparse on trobleshooting and also makes it hard to figure out simple tasks like how to set up a platform configuration. So I resorted to guesswork and put this in my ~/.cylc/flow/global.cylc
[platforms]
[[cluster-slurm]]
hosts = metoc-cl4
job runner = slurm
retrieve job logs = True
install target = localhost
[[cluster-background]]
hosts = metoc-cl4
job runner = background
retrieve job logs = True
install target = localhost
I basically have two types of tasks, those that use slurm (to run on the compute nodes) and those tiny shell scripts that run on the headnode.
I can’t get even to the point where the task is executed, let alone submitted using slurm.
With some guesswork I pieced together the installation (documentation is badly lacking here) and installed a cylc wrapper script using cylc get-resources cylc /usr/local/bin/
and then did linked /usr/local/bin/rose
to point to the cylc script. (That bit is missing in the documenation). So now it finds the cylc and rose commands when it runs on the headnode.
I ran cylc vip suitename
and inspected the log directory on the cluster headnode. The documenation says I should be inspecting the “workflow log” but there’s no such thing (I guess it’s been renamed, and the docs haven’t kept up to point out where to find this).
the job.err for a submit-failed (using platform cluster-background
) jobs says
/home/model/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/build_uscc_engine/02/job: line 89: /home/model/cylc-run/downloader_suite_cl4/ru
n8/.service/etc/job.sh: No such file or directory
/home/model/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/build_uscc_engine/02/job: line 90: cylc__job__main: command not found
which is odd, it complains about not finding something I have no control over.
BTW, I had it past this point previously but forgot what I did, so cannot replicate it.
the logs for a job using platform cluster-slurm
only has the job and the job.status file
the job.status
of this job simply reads:
CYLC_JOB_RUNNER_NAME=slurm
and no other log files are created.
I then inspected the cylc cat-log
output on the cylc host, and see this:
2023-08-24T21:05:36Z INFO - Scheduler: url=tcp://cylcsrvr2:43015 pid=2900868
2023-08-24T21:05:36Z INFO - Workflow publisher: url=tcp://cylcsrvr2:43007
2023-08-24T21:05:36Z INFO - Run: (re)start number=1, log rollover=3
2023-08-24T21:05:36Z INFO - Cylc version: 8.2.0
2023-08-24T21:05:36Z INFO - Run mode: live
2023-08-24T21:05:36Z INFO - Initial point: 20230818T0000Z
2023-08-24T21:05:36Z INFO - Final point: 30000101T0000Z
2023-08-24T21:11:27Z CRITICAL - [20230822T0000Z/get_cmems_medwam_analysis_00 preparing job:10 flows:1] submission failed
2023-08-24T21:11:27Z INFO - [20230822T0000Z/get_cmems_medwam_analysis_00 preparing job:10 flows:1] => waiting
2023-08-24T21:11:27Z WARNING - [20230822T0000Z/get_cmems_medwam_analysis_00 waiting job:10 flows:1] retrying in PT30S (after 2023-08-24T21:11:57Z)
2023-08-24T21:11:27Z ERROR - [jobs-submit cmd] ssh -oBatchMode=yes -oConnectTimeout=10 metoc-cl4 env CYLC_VERSION=8.2.0 bash --login -c ''"'"'exec "$
0" "$@"'"'"'' cylc jobs-submit --debug --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --pat
h=/usr/local/sbin -- '$HOME/cylc-run/downloader_suite_cl4/run8/log/job' 20230822T0000Z/get_cmems_medwam_analysis_00/10 20230822T0000Z/get_cmems_medwa
m_forecast_00/10
[jobs-submit ret_code] 1
[jobs-submit out] 2023-08-24T21:11:27Z|20230822T0000Z/get_cmems_medwam_forecast_00/10|1
2023-08-24T21:11:27Z DEBUG - [20230822T0000Z/get_cmems_medwam_forecast_00 preparing job:10 flows:1] (internal)submission failed at 2023-08-24T21:11:2
7Z
2023-08-24T21:11:27Z CRITICAL - [20230822T0000Z/get_cmems_medwam_forecast_00 preparing job:10 flows:1] submission failed
I don’t understand what’s going on.
Maybe someone can suggest a step-by-step procedure to debug this. I can’t see much that’s useful in the logs, so I’m really stuck.
Any help is much appreciated
Thanks,
Fred