.service/contact file keeps getting deleted

cylc 8.4.2

This problem is intermittent, but not sporadic-- when it’s happening, it happens to every suite I attempt to run. The observable symptom is that the suite switches itself to stopped after a few seconds. If I play it again, it resumes, and in all respects functions as normal, but then it’s stopped again.

The cause has a trace in the logs, which is this:

2025-05-09T23:07:31Z WARNING - failed to remove workflow contact file: WORKFLOW/run1/.service/contact
2025-05-09T23:07:31Z ERROR - [Errno 2] No such file or directory: 'WORKFLOW/run1/.service/contact'
 cylc.flow.main_loop.MainLoopPluginException: /<path>/<to>/<WORKFLOW>/run1/.service/contact: contact file corrupted/modified and may be left

As part of its normal procedure, cylc unlinks this file, but it appears that some process is making problems. NOTE: there is a symlink in the path to my cylc-run directory, I set it up this way because of space limitations in $HOME on my system. If this is the problem, I can’t figure out why it happens only intermittently.

Hi @ejhyer

If the workflow contact file (~/cylc-run/<workflow-id>/.service/contact) gets removed or somehow becomes inaccessible, the scheduler will notice that and shut down - with the error you’ve seen - when the main loop “heath check” plugin runs, which by default is every 10 minutes:

# global.cylc
[scheduler]
    [[main loop]]
        plugins = health check, reset bad hosts
        [[[health check]]]
            interval = PT10S

(The contact file is only needed for client commands to get scheduler contact details, so if this happens a workflow restart will work fine, as you seem to have found, so long as the workflow database and other files in the run directory are OK, because the new scheduler instance will create a new contact file at start-up).

I’m not aware of any way that a Cylc scheduler would delete its own contact file, and I’ve not seen any other reports of this problem.

The fact that it happens intermittently to all of your worklfows at the same time suggests maybe a filesystem problem that makes the contact files inaccessible?

(Although if so, the scheduler would presumably also lose the ability to write to its run database, which is in the same location, and that would cause an even faster shutdown.)

Something else to consider might be that client commands, such as cylc scan -t rich, automatically delete orphaned contact files (i.e., contact files left behind if a scheduler dies without cleaning up after itself). It shouldn’t be possible for a client command to wrongly conclude that a live scheduler is dead, but it might be worth considering. Maybe if the client runs on a host that can see the contact file and for some reason thinks it is running on the same host as the scheduler but it really isn’t??

If this happens, the client will print a warning. You can try it by killing a scheduler with kill -9 PID then running cylc scan -t rich:

$ kill -9 20848  # scheduler PID

$ cylc scan   # this just reads contact files
bug/run32 NIWA-1022450.niwa.local:43039 20848 

$ cylc scan -t rich  # this tries to contact the scheduler, and finds it's dead
Job State Key: ■ submitted ■ submit-failed ■ running ■ succeeded ■ failed
INFO - Removed contact file for bug/run32 (workflow no longer running).
WARNING - Workflow not running: bug/run32

I’m not aware of any way that a Cylc scheduler would delete its own contact file, and I’ve not seen any other reports of this problem.

Thanks for diving in @hilary.j.oliver . Here’s some more of what I am seeing. When I cylc play (identical symptoms using GUI or command line), the contact file is created, with these permissions:

-rw-r--r--. 1 ehyer 4742J592 654 May 12 05:09 contact
The file is created along with along with other files, including the client* files, in /.service. The contact file then immediately vanishes (like I can ls it but I can’t do it twice), and the suite accordingly stops, presumably by the “health check” mechanism you describe. The other files in /.service disappear after the suite has been stopped for a few minutes.

Based on the file permissions, it’s either me or a superuser causing the file to vanish. Add in the fact that no one else reports this problem, and it’s pretty clearly me. When I do ps -Af | grep cylc I have only these processes:

ehyer    1756524       1  0 05:09 ?        00:00:01 <...>/envs/cylc8-ui/bin/python <...>envs/cylc8-ui/bin/cylc play --color=never --mode live SP8_TEST2/run1
ehyer    1878194 3325313  0 05:17 pts/7    00:00:00 grep --color=auto cylc
ehyer    3348519 3345098  5 May11 pts/9    00:44:56 <...>/envs/cylc8-ui/bin/python <...>/envs/cylc8-ui/bin/cylc gui --port=19971

The only other processes I have running that I can see via ps are related to VSCode.

This system has very active administration running all kinds of background processes on the login nodes, but I’m baffled by the symptom which appears to be some sort of process zapping .service/contact and nothing else.

Any ideas for troubleshooting are welcome as always, I really appreciate what you do and I enjoy using Cylc!

1 Like

I managed to cat ./service/contact before it disappeared, in case there might be any clues inside the file:

CYLC_VERSION=8.4.2
CYLC_WORKFLOW_COMMAND=/p/home/ehyer/src/conda/miniforge3/envs/cylc8-ui/bin/python /p/home/ehyer/src/conda/miniforge3/envs/cylc8-ui/bin/cylc play --color=never --mode live SP8_TEST2/run1
CYLC_WORKFLOW_HOST=nautilus.navydsrc.hpc.mil
CYLC_WORKFLOW_ID=SP8_TEST2/run1
CYLC_WORKFLOW_OWNER=ehyer
CYLC_WORKFLOW_PID=2205339
CYLC_WORKFLOW_PORT=43014
CYLC_WORKFLOW_PUBLISH_PORT=43052
CYLC_WORKFLOW_RUN_DIR_ON_WORKFLOW_HOST=/p/home/ehyer/cylc-run/SP8_TEST2/run1
CYLC_WORKFLOW_UUID=62c3fd27-2e24-44a5-b730-6fdc3044f9b2
SCHEDULER_CYLC_PATH=None
SCHEDULER_SSH_COMMAND=ssh -oBatchMode=yes -oConnectTimeout=10
SCHEDULER_USE_LOGIN_SHELL=True

There are only two cases where Cylc would delete the contact file:

  1. When a workflow shuts down normally (we know it isn’t this).
  2. When a subsequent Cylc command detects that a scheduler has crashed/been killed (shouldn’t happen).

The second mechanism is activated when a Cylc command to the scheduler times out. In this eventuality, Cylc checks whether the process (CYLC_WORKFLOW_PID) is still running on the box (CYLC_WORKFLOW_HOST) with the same command (CYLC_WORKFLOW_COMMAND). Cylc is effectively performing this test to see if the scheduler is still running:

$ ssh $CYLC_WORKFLOW_HOST ps -o cmd -p $CYLC_WORKFLOW_PID | grep $CYLC_WORKFLOW_COMMAND

If this test determines that the process is no longer running, or has been replaced by another process (identified by a different command string), then Cylc determines that the scheduler has either encountered a critical failure, was killed (kill -9) or the box it was running on crashed. In this situation, Cylc would delete the contact file to allow the workflow to be restarted.

If this happens, the Cylc command that actioned it will log the following message:

Removed contact file for <workfow> (workflow no longer running).

This could result from a variety of Cylc commands, e.g:

  • cylc scan - check the terminal the command was run in for this message.
  • The Cylc GUI - check ~/.cylc/uiserver/log/*.
  • Task communication - check ~/cylc-run/<workflow>/log/job/<cycle>/<task>/<job>/job.out

We haven’t had a report of this test going wrong in the past, but if you find this message somewhere it would strongly indicate that it has.

If not, then we’re left with esoteric filesystem issues which are a pain to prove / diagnose :frowning:

  1. I logged back in today, and the symptoms have stopped again-- my suites behave as expected today. I think that I am on a different login node, but I have not kept careful notes on that, unfortunately.

  2. Here is the trace from the ~/.cylc/uiserver/log files while the problem was occurring:

2025-05-11T15:59:04 INFO     [data-store] connect_workflow('~ehyer/SP8_CYCLING_TEST/run8', <dict>)
2025-05-11T15:59:09 INFO     Removed contact file for SP8_CYCLING_TEST/run8 (workflow no longer running).
2025-05-11T15:59:09 ERROR    WorkflowStopped: SP8_CYCLING_TEST/run8 is not running
2025-05-11T15:59:09 INFO     failed to connect to ~ehyer/SP8_CYCLING_TEST/run8
2025-05-11T15:59:09 INFO     [data-store] disconnect_workflow('~ehyer/SP8_CYCLING_TEST/run8')
2025-05-11T15:59:44 INFO     [data-store] connect_workflow('~ehyer/SP8_CYCLING_TEST/run8', <dict>)
2025-05-11T15:59:49 INFO     Removed contact file for SP8_CYCLING_TEST/run8 (workflow no longer running).
2025-05-11T15:59:49 ERROR    WorkflowStopped: SP8_CYCLING_TEST/run8 is not running
2025-05-11T15:59:49 INFO     failed to connect to ~ehyer/SP8_CYCLING_TEST/run8
2025-05-11T15:59:49 INFO     [data-store] disconnect_workflow('~ehyer/SP8_CYCLING_TEST/run8')
2025-05-11T16:00:04 INFO     [data-store] connect_workflow('~ehyer/SP8_CYCLING_TEST/run8', <dict>)
2025-05-11T16:00:09 INFO     Removed contact file for SP8_CYCLING_TEST/run8 (workflow no longer running).
2025-05-11T16:00:09 ERROR    WorkflowStopped: SP8_CYCLING_TEST/run8 is not running
2025-05-11T16:00:09 INFO     failed to connect to ~ehyer/SP8_CYCLING_TEST/run8
2025-05-11T16:00:09 INFO     [data-store] disconnect_workflow('~ehyer/SP8_CYCLING_TEST/run8')

Three observations:

  1. This is a series of 4-line sequences. The time elapsed between disconnect workflow and connect workflow is variable because these were punctuated by my intervention to cylc play either in the GUI or from the command line. These cylc play invocations do not show up in the log, which seems very strange (and also like I need my head checked, but let’s pursue this and if the result is mental health intervention for me, I will be in your debt even moreso).

  2. This log from today’s session does not have anything like this in it, just the stuff you expect:

2025-05-12T18:24:11 INFO     [data-store] register_workflow('~ehyer/SP8_TEST4/run1', False)
2025-05-12T18:24:31 INFO     [data-store] connect_workflow('~ehyer/SP8_TEST4/run1', <dict>)
2025-05-12T18:27:09 INFO     [data-store] disconnect_workflow('~ehyer/SP8_TEST3/run1')
2025-05-12T18:34:01 INFO     [data-store] disconnect_workflow('~ehyer/SP8_TEST4/run1')
2025-05-12T18:34:06 INFO     $ cylc play --color=never --mode live SP8_TEST4/run1
2025-05-12T18:34:09 INFO     [data-store] connect_workflow('~ehyer/SP8_TEST4/run1', <dict>)
  1. In both yesterday’s log and today, it systematically refers to ~ehyer/SUITE/runN though the actual directory is ~/ehyer/cylc-run/SUITE/runN` but I assume that’s just a quirk of the logging.

OK, it seems your problem is the second scenario that I described, and which @oliver.sanders provided more detail on.

And the offending client command is issued by the UI Server.

So for one thing, this should stop happening if you kill your UI Server next time, and just interact with the GUI from the command line. (Or if it also happens with command line clients too, you’ll see the “removed contact” file message very clearly in your terminal).

These cylc play invocations do not show up in the log, which seems very strange

The CLI doesn’t go via the UI Server, so you won’t see cylc play invocations in the UI Server log.

I think that I am on a different login node,

It might be worth keeping track of this for future reference.

The way this works is:

  • the client command gets workflow contact details (host, port, PID) from the contact file
  • it attempts to connect to host:port, to perform the requested operation
  • if the connection times out, it ssh’s to host to see if the PID is still in the process table
  • if it isn’t, we conclude the workflow is no longer running, and delete the orphaned contact file to allow restart and avoid waiting for timeout again

If the client is not able to ssh to the run host, it won’t delete the contact file because it can’t determine whether or not the workflow is running.

There must be something going wrong with this process, perhaps depending on which node you’re running the UI Server on.

Perhaps there’s something about your network such that the connection attempt and PID check is done on the wrong host, so it looks as if the workflow is not running even though it is (on another host).

Are you perhaps running Cylc on a group of HPC login nodes with a DNS alias that switches over sometimes? - such that the original host name ends up pointing at a different node?

1 Like

That’s actually the full-format workflow ID (see cylc help id) rather than the file path.

The full format is ~user/workflow-id, where the workflow-id bit correspond to the path under ~/cylc-run.

We might need to consider changing how that’s logged … full-format workflow IDs looks like wrong path · Issue #688 · cylc/cylc-uiserver · GitHub

I consider this likely. It is intended to be transparent to DNS, but clearly it is only-sometimes-transparent.

Thanks very much for all your patient assistance with this problem.

–Edward H.

If this does turn out to be a DNS issue, it might be worth trying out global.cylc[scheduler][host self-identification]method = address (if not already). This switches Cylc from using host names (which might not be as unique or stable as we might want them to be) to IP addresses (which one would hope are both stable and unique).

You can test this in user space (without editing the site configuration) by adding it to ~/.cylc/flow/global.cylc.