Hello, I’m finally gettin round to migrating to cylc8, but so far I haven’t been able to get even the simplest tasks working.
The documentation is rather sparse on trobleshooting and also makes it hard to figure out simple tasks like how to set up a platform configuration. So I resorted to guesswork and put this in my
[platforms] [[cluster-slurm]] hosts = metoc-cl4 job runner = slurm retrieve job logs = True install target = localhost [[cluster-background]] hosts = metoc-cl4 job runner = background retrieve job logs = True install target = localhost
I basically have two types of tasks, those that use slurm (to run on the compute nodes) and those tiny shell scripts that run on the headnode.
I can’t get even to the point where the task is executed, let alone submitted using slurm.
With some guesswork I pieced together the installation (documentation is badly lacking here) and installed a cylc wrapper script using
cylc get-resources cylc /usr/local/bin/ and then did linked
/usr/local/bin/rose to point to the cylc script. (That bit is missing in the documenation). So now it finds the cylc and rose commands when it runs on the headnode.
cylc vip suitename and inspected the log directory on the cluster headnode. The documenation says I should be inspecting the “workflow log” but there’s no such thing (I guess it’s been renamed, and the docs haven’t kept up to point out where to find this).
the job.err for a submit-failed (using platform
cluster-background) jobs says
/home/model/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/build_uscc_engine/02/job: line 89: /home/model/cylc-run/downloader_suite_cl4/ru n8/.service/etc/job.sh: No such file or directory /home/model/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/build_uscc_engine/02/job: line 90: cylc__job__main: command not found
which is odd, it complains about not finding something I have no control over.
BTW, I had it past this point previously but forgot what I did, so cannot replicate it.
the logs for a job using platform
cluster-slurm only has the job and the job.status file
job.status of this job simply reads:
and no other log files are created.
I then inspected the
cylc cat-log output on the cylc host, and see this:
2023-08-24T21:05:36Z INFO - Scheduler: url=tcp://cylcsrvr2:43015 pid=2900868 2023-08-24T21:05:36Z INFO - Workflow publisher: url=tcp://cylcsrvr2:43007 2023-08-24T21:05:36Z INFO - Run: (re)start number=1, log rollover=3 2023-08-24T21:05:36Z INFO - Cylc version: 8.2.0 2023-08-24T21:05:36Z INFO - Run mode: live 2023-08-24T21:05:36Z INFO - Initial point: 20230818T0000Z 2023-08-24T21:05:36Z INFO - Final point: 30000101T0000Z 2023-08-24T21:11:27Z CRITICAL - [20230822T0000Z/get_cmems_medwam_analysis_00 preparing job:10 flows:1] submission failed 2023-08-24T21:11:27Z INFO - [20230822T0000Z/get_cmems_medwam_analysis_00 preparing job:10 flows:1] => waiting 2023-08-24T21:11:27Z WARNING - [20230822T0000Z/get_cmems_medwam_analysis_00 waiting job:10 flows:1] retrying in PT30S (after 2023-08-24T21:11:57Z) 2023-08-24T21:11:27Z ERROR - [jobs-submit cmd] ssh -oBatchMode=yes -oConnectTimeout=10 metoc-cl4 env CYLC_VERSION=8.2.0 bash --login -c ''"'"'exec "$ 0" "$@"'"'"'' cylc jobs-submit --debug --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --pat h=/usr/local/sbin -- '$HOME/cylc-run/downloader_suite_cl4/run8/log/job' 20230822T0000Z/get_cmems_medwam_analysis_00/10 20230822T0000Z/get_cmems_medwa m_forecast_00/10 [jobs-submit ret_code] 1 [jobs-submit out] 2023-08-24T21:11:27Z|20230822T0000Z/get_cmems_medwam_forecast_00/10|1 2023-08-24T21:11:27Z DEBUG - [20230822T0000Z/get_cmems_medwam_forecast_00 preparing job:10 flows:1] (internal)submission failed at 2023-08-24T21:11:2 7Z 2023-08-24T21:11:27Z CRITICAL - [20230822T0000Z/get_cmems_medwam_forecast_00 preparing job:10 flows:1] submission failed
I don’t understand what’s going on.
Maybe someone can suggest a step-by-step procedure to debug this. I can’t see much that’s useful in the logs, so I’m really stuck.
Any help is much appreciated