Cylc set slow to be able to run after workflow started?

Using Cylc 8.3.6 (but also in 8.3.5 at least, haven’t checked anything else).

I launched a workflow, then tried to do some cylc set operation, but it kept telling me the workflow was stopped.

$ cylc vip --pause $PWD
$ cylc validate /home/code/retry
Valid for cylc-8.3.6
$ cylc install /home/code/retry
INSTALLED retry/run4 from /home/retry
$ cylc play --pause retry/run4

 ▪ ■  Cylc Workflow Engine 8.3.6
 ██   Copyright (C) 2008-2024 NIWA
▝▘    & British Crown (Met Office) & Contributors

INFO - Extracting job.sh to /home/cylc-
    run/retry/run4/.service/etc/job.sh
retry/run4: dev-cylc-003.hpc.internal.bom.gov.au PID=348337
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
WorkflowStopped: retry/run4 is not running
$ cylc set retry/run4 //20201010T0000Z/foo
Command queued

I tried very frequently, but it seemd to take 30 seconds before cylc set could be actioned. In the meantime, if I hadn’t paused it, tasks could run without issue (as they had in other runs I had tried this with).

2024-11-29T04:37:30Z INFO - Workflow: retry/run4
2024-11-29T04:37:30Z INFO - Scheduler: url=tcp://hostname:43032 pid=348337
2024-11-29T04:37:30Z INFO - Workflow publisher: url=tcp://hostname:43213
2024-11-29T04:37:30Z INFO - Run: (re)start number=1, log rollover=1
2024-11-29T04:37:30Z INFO - Cylc version: 8.3.6
2024-11-29T04:37:30Z INFO - Run mode: live
2024-11-29T04:37:30Z INFO - Initial point: 20201010T0000Z
2024-11-29T04:37:30Z INFO - Final point: 20201010T0000Z
2024-11-29T04:37:30Z INFO - Cold start from 20201010T0000Z
2024-11-29T04:37:30Z INFO - New flow: 1 (original flow from 20201010T0000Z) 2024-11-29T04:37:30+00:00
2024-11-29T04:37:30Z INFO - [20201010T0000Z/foo:waiting(runahead)] => waiting
2024-11-29T04:37:30Z INFO - [20201010T0000Z/foo:waiting] => waiting(queued)
2024-11-29T04:37:31Z INFO - Pausing the workflow: Paused on start up
2024-11-29T04:37:59Z INFO - Command "set" received. ID=87048f0d-1b61-4125-901c-655472530bec
    set(flow=[], flow_wait=False, outputs=[], prerequisites=[], tasks=['20201010T0000Z/foo'])
2024-11-29T04:38:00Z INFO - [20201010T0000Z/foo:waiting(queued)] setting implied output: submitted
2024-11-29T04:38:00Z INFO - [20201010T0000Z/foo:waiting(queued)] setting implied output: started
2024-11-29T04:38:00Z INFO - [20201010T0000Z/foo:waiting(queued)] => failed(queued)
2024-11-29T04:38:00Z INFO - [20201010T0000Z/bar:waiting(runahead)] => waiting
2024-11-29T04:38:00Z INFO - [20201010T0000Z/bar:waiting] => waiting(queued)
2024-11-29T04:38:00Z INFO - [20201010T0000Z/foo/00:failed(queued)] => failed
2024-11-29T04:38:00Z INFO - Command "set" actioned. ID=87048f0d-1b61-4125-901c-655472530bec

Is the delay for set being able to work expected?

I can repeat this. It applies to lots of commands, e.g cylc ping. However, in my tests, if I run cylc scan first, that detects the workflow immediately and all other commands then work. So, something is clearly not right.

Note that this only happens when using a separate scheduler host. If you run with --host=localhost it doesn’t happen so I assume it is network filesystem related.

I’ve opened https://github.com/cylc/cylc-flow/issues/6504

Is the delay for set being able to work expected?

Some delay might be expected, but not this much.


I can also repeat this.

I think Dave’s right, this is something to do with the network filesystem. When I try to cat the file, it doesn’t exist, but after I run cylc scan it magically appears???

$ cylc vip . -n foo
...

 ▪ ■  Cylc Workflow Engine 8.4.0.dev
 ██   Copyright (C) 2008-2024 NIWA
▝▘    & British Crown (Met Office) & Contributors
...

$ cat ~/cylc-run/foo/runN/.service/contact
cat ... No such file or directory
$ cylc scan
foo/run1
$ cat ~/cylc-run/foo/runN/.service/contact
CYLC_API=5
...

Questions:

  1. Why is the contact file slow to sync on the network filesystem? Have we done anything to exacerbate this?
  2. Why does cylc scan make the file magically appear?

I’ve got an answer to question (2)!

The cylc scan command is performing filesystem listings, if I run ls on the workflow and .service directories first, the file magically appears:

$ cylc vip -n foo
...

$ cat ~/cylc-run/foo/runN/.service/contact
cat ... No such file or directory

$ ls /home/users/oliver.sanders/cylc-run/generic/runN/
...

$ ls /home/users/oliver.sanders/cylc-run/generic/runN/.service
...

$ cat ~/cylc-run/foo/runN/.service/contact
CYLC_API=5
...

This actually rings a bell, I’m sure we’ve performed directory listings before to force synchronization for something.

It turns out that this is actually something we already defend against (it rang a bell for a reason)!

However, the addition of the run name/number added a new level to the directory hierarchy, as a result, it would appear that we also need to list the parent directory (hence the two ls calls required in my above example rather than one).

If you run the workflow with --no-run-name, the contact file appears straight away without the need for any directory listing.

Hierarchical installation (e.g. cylc vip -n a/b/c) does not appear to trigger this issue, so I think the cause of the issue is the runN symlink.

Opened an issue, should be an easy fix running workflows may appear stopped on network filesystems · Issue #6506 · cylc/cylc-flow · GitHub