Hi @ejhyer
If the workflow contact file (~/cylc-run/<workflow-id>/.service/contact
) gets removed or somehow becomes inaccessible, the scheduler will notice that and shut down - with the error you’ve seen - when the main loop “heath check” plugin runs, which by default is every 10 minutes:
# global.cylc
[scheduler]
[[main loop]]
plugins = health check, reset bad hosts
[[[health check]]]
interval = PT10S
(The contact file is only needed for client commands to get scheduler contact details, so if this happens a workflow restart will work fine, as you seem to have found, so long as the workflow database and other files in the run directory are OK, because the new scheduler instance will create a new contact file at start-up).
I’m not aware of any way that a Cylc scheduler would delete its own contact file, and I’ve not seen any other reports of this problem.
The fact that it happens intermittently to all of your worklfows at the same time suggests maybe a filesystem problem that makes the contact files inaccessible?
(Although if so, the scheduler would presumably also lose the ability to write to its run database, which is in the same location, and that would cause an even faster shutdown.)
Something else to consider might be that client commands, such as cylc scan -t rich
, automatically delete orphaned contact files (i.e., contact files left behind if a scheduler dies without cleaning up after itself). It shouldn’t be possible for a client command to wrongly conclude that a live scheduler is dead, but it might be worth considering. Maybe if the client runs on a host that can see the contact file and for some reason thinks it is running on the same host as the scheduler but it really isn’t??
If this happens, the client will print a warning. You can try it by killing a scheduler with kill -9 PID
then running cylc scan -t rich
:
$ kill -9 20848 # scheduler PID
$ cylc scan # this just reads contact files
bug/run32 NIWA-1022450.niwa.local:43039 20848
$ cylc scan -t rich # this tries to contact the scheduler, and finds it's dead
Job State Key: ■ submitted ■ submit-failed ■ running ■ succeeded ■ failed
INFO - Removed contact file for bug/run32 (workflow no longer running).
WARNING - Workflow not running: bug/run32