Using Cylc 8.3.6 (but also in 8.3.5 at least, haven’t checked anything else).
I launched a workflow, then tried to do some cylc set operation, but it kept telling me the workflow was stopped.
$ cylc vip --pause $PWD
$ cylc validate /home/code/retry
Valid for cylc-8.3.6
$ cylc install /home/code/retry
INSTALLED retry/run4 from /home/retry
$ cylc play --pause retry/run4
▪ ■ Cylc Workflow Engine 8.3.6
██ Copyright (C) 2008-2024 NIWA
▝▘ & British Crown (Met Office) & Contributors
INFO - Extracting job.sh to /home/cylc-
run/retry/run4/.service/etc/job.sh
retry/run4: dev-cylc-003.hpc.internal.bom.gov.au PID=348337
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
Command queued
I tried very frequently, but it seemd to take 30 seconds before cylc set could be actioned. In the meantime, if I hadn’t paused it, tasks could run without issue (as they had in other runs I had tried this with).
2024-11-29T04:37:30Z INFO - Workflow: retry/run4
2024-11-29T04:37:30Z INFO - Scheduler: url=tcp://hostname:43032 pid=348337
2024-11-29T04:37:30Z INFO - Workflow publisher: url=tcp://hostname:43213
2024-11-29T04:37:30Z INFO - Run: (re)start number=1, log rollover=1
2024-11-29T04:37:30Z INFO - Cylc version: 8.3.6
2024-11-29T04:37:30Z INFO - Run mode: live
2024-11-29T04:37:30Z INFO - Initial point: 20201010T0000Z
2024-11-29T04:37:30Z INFO - Final point: 20201010T0000Z
2024-11-29T04:37:30Z INFO - Cold start from 20201010T0000Z
2024-11-29T04:37:30Z INFO - New flow: 1 (original flow from 20201010T0000Z) 2024-11-29T04:37:30+00:00
2024-11-29T04:37:30Z INFO - [20201010T0000Z/foo:waiting(runahead)] => waiting
2024-11-29T04:37:30Z INFO - [20201010T0000Z/foo:waiting] => waiting(queued)
2024-11-29T04:37:31Z INFO - Pausing the workflow: Paused on start up
2024-11-29T04:37:59Z INFO - Command "set" received. ID=87048f0d-1b61-4125-901c-655472530bec
set(flow=[], flow_wait=False, outputs=[], prerequisites=[], tasks=['20201010T0000Z/foo'])
2024-11-29T04:38:00Z INFO - [20201010T0000Z/foo:waiting(queued)] setting implied output: submitted
2024-11-29T04:38:00Z INFO - [20201010T0000Z/foo:waiting(queued)] setting implied output: started
2024-11-29T04:38:00Z INFO - [20201010T0000Z/foo:waiting(queued)] => failed(queued)
2024-11-29T04:38:00Z INFO - [20201010T0000Z/bar:waiting(runahead)] => waiting
2024-11-29T04:38:00Z INFO - [20201010T0000Z/bar:waiting] => waiting(queued)
2024-11-29T04:38:00Z INFO - [20201010T0000Z/foo/00:failed(queued)] => failed
2024-11-29T04:38:00Z INFO - Command "set" actioned. ID=87048f0d-1b61-4125-901c-655472530bec
I can repeat this. It applies to lots of commands, e.g cylc ping. However, in my tests, if I run cylc scan first, that detects the workflow immediately and all other commands then work. So, something is clearly not right.
Note that this only happens when using a separate scheduler host. If you run with --host=localhost it doesn’t happen so I assume it is network filesystem related.
I think Dave’s right, this is something to do with the network filesystem. When I try to cat the file, it doesn’t exist, but after I run cylc scan it magically appears???
It turns out that this is actually something we already defend against (it rang a bell for a reason)!
However, the addition of the run name/number added a new level to the directory hierarchy, as a result, it would appear that we also need to list the parent directory (hence the two ls calls required in my above example rather than one).
If you run the workflow with --no-run-name, the contact file appears straight away without the need for any directory listing.
Hierarchical installation (e.g. cylc vip -n a/b/c) does not appear to trigger this issue, so I think the cause of the issue is the runN symlink.