More interesting contact file behavior: GUI/daemon interactions?

cylc-flow                 8.4.2              pyh3a29b38_0    conda-forge
cylc-flow-base            8.4.2              pyh707e725_0    conda-forge
cylc-uiserver             1.6.1              pyhd8ed1ab_0    conda-forge
cylc-uiserver-base        1.6.1              pyh707e725_0    conda-forge

I left a suite running and shut down the GUI and logged out. Later I logged back in and restarted the GUI to check on the suite. I found this interesting sequence in the log files, which I am arranging by timestamp here:
in log/scheduler/07-restart-05.log:

2025-06-10T17:15:35Z INFO - [20190317T1800Z/stage_naaps_conc/01:preparing] submitted to localhost:background[4113513]
2025-06-10T17:15:35Z INFO - [20190317T1800Z/stage_naaps_conc/01:preparing] => submitted

These are ordinary routine messages indicating nominal behavior.
When the GUI loaded, the suite was marked Stopped and so I pressed play. Cylc switched to a new log and wrote this:
in log/scheduler/08-restart-06.log:

2025-06-10T17:15:48Z INFO - Workflow: SP8_CYCLING/run7
2025-06-10T17:15:48Z INFO - LOADING workflow parameters

You can see that the suite stopped, but it clearly stopped because of my activity: it showed stopped in the GUI, but it was only in that state for <15 seconds, which tells me that firing up the GUI caused the change of state. Cylc actually wrote a trace for this, but put it at the end of the earlier log, in 07-restart-05.log:


2025-06-10T17:22:38Z CRITICAL - An uncaught error caused Cylc to shut down.
    If you think this was an issue in Cylc, please report the following traceback to the developers.
    https://github.com/cylc/cylc-flow/issues/new?assignees=&labels=bug&template=bug.md&title=;
2025-06-10T17:22:38Z ERROR - contact file modified
    Traceback (most recent call last):
      File "/p/home/ehyer/src/conda/miniforge3/envs/cylc8-ui/lib/python3.9/site-packages/cylc/flow/scheduler.py", line 710, in run_scheduler
        await self._main_loop()
      File "/p/home/ehyer/src/conda/miniforge3/envs/cylc8-ui/lib/python3.9/site-packages/cylc/flow/scheduler.py", line 1869, in _main_loop
        await asyncio.gather(
      File "/p/home/ehyer/src/conda/miniforge3/envs/cylc8-ui/lib/python3.9/site-packages/cylc/flow/main_loop/__init__.py", line 195, in _wrapper
        raise MainLoopPluginException(exc) from None
    cylc.flow.main_loop.MainLoopPluginException: contact file modified
2025-06-10T17:22:38Z CRITICAL - Workflow shutting down - contact file modified
2025-06-10T17:22:39Z INFO - DONE

My best guess is this is somehow related to the contact file problem, and potentially specific to my HPC system, but it appears to be an unexpected interaction between the GUI and the daemon, so it’s concerning.

Hi @ejhyer

The timestamps seem to show the workflow shut down (due to corrupted contact file) AFTER you restarted a new instance of the same run from the GUI - which should not be possible.

  • restart05 was running (one instance of the scheduler program)
  • you opened a GUI, which said that the workflow was stopped (it wasn’t)
    • so you pressed “play” to start the workflow
    • which started restart06 (under another instance of the scheduler)
    • the restart06 run modified the contact file in the run directory (as it should)
    • or the GUI removed the contact file after confirming (wrongly) that the workflow was not running
  • restart05 then noticed that something had modified its contact file, so it shut down (as it should).

I think this means that your UI Server (cylc gui) could not see the workflow so it incorrectly concluded that it had stopped. You then started a new instance of the same run that was fighting over the same run directory as the first instance.

Possible causes:

  • filesystem issues that temporarily made the workflow run directory invisible
    • unlikely?
  • if running your workflow on an HPC login node with a DNS aliased name (e.g. “hpc-login” points to one or other of two actual nodes “hpc-login01” or “hpc-login02”) - if the alias switches over to the other node it will suddenly look as if your workflow is not running on “hpc-login”
    • likely, as this came up already on your other thread on this forum?
1 Like

Note that no part of Cylc (including the GUI) other than the scheduler that owns it should ever modify a workflow contact file.

The only exception to this is that Cylc commands will remove an orphaned contact file if:

  • attempting to connect to the scheduler at host:port (as listed in the contact file) times out
  • and ssh to check scheduler PID on the host (as listed in the contact file) shows the scheduler process is no longer running

This indicates that a scheduler got killed without cleaning up its contact file. The check should be foolproof - but we do (naturally!) assume that the host name doesn’t magically point to a different actual host than when the scheduler started up.

I think this means that your UI Server (cylc gui) could not see the workflow so it incorrectly concluded that it had stopped .

It’s a long shot, but it’s possible that there might be some evidence of this in one of these log files: ~/.cylc/uiserver/*.

This is definitely related to the use of multiple login nodes on the system. That setup is intended to operate “invisibly” but it trips up on the SSH calls needed by cylc.

Thanks very much for your patient assistance with this one. There’s probably no mitigation except to force my login to a specific node.

Ideally, you would not have to run workflows on HPC login nodes.

Possibilities include:

  • other interactive nodes allocated to specific groups or projects (as opposed to general purpose login nodes)
  • a dedicated small pool of Cylc run hosts on the HPC (configure Cylc to use them automatically)
  • off-side Cylc hosts that are able to use the HPC as just a job platform

However, that sort of support is typically only available at a site where Cylc is officially and heavily used.

Short of that, yes, I think you’ll need to target a specific login node by name, not by the alias that could point to any one of several nodes.

Did you try global.cylc[scheduler][host self-identification]method = address (suggested in the other thread by @oliver.sanders)? Did that not solve the issue?

Did you try global.cylc[scheduler][host self-identification]method = address

I did, that did not solve the issue. I can add some further description:

  • The issue definitely pertains to name-sharing of multiple login nodes.
  • Suites started on login01 cannot be controlled from login02 etc.
  • cylc scan will show these suites, accurately, as active
  • cylc scan -t rich or any attempt to interact via TUI will fail.
  • cylc GUI will show the suite as Stopped.
  • If you attempt to interact from the wrong node, cylc will delete the contact file, and the suite will need to be restarted.
  • Just starting up the GUI will not cause this. What I found earlier was that, seeing the GUI (incorrectly) indicating the suite as Stopped, I clicked Play and the result was a deleted contact file.

I don’t have any firm recommendations to change Cylc’s behavior, except maybe it’s possible for the GUI to show something other than “Stopped” in these cases.
As always, I’m very appreciative of everyone’s thoughtful contributions!

Thanks for the update @ejhyer

OK, I think it all makes sense now, as an (in retrospect, at least) inevitable consequence of the “name-shared” multiple login nodes - given that Cylc assumes that the scheduler host name recorded at start up will always point at the actual run host.

(One question: when actually logged into a login node, does it self-identify - e.g. via the hostname command - as the generic name, or as the “real” host name).

I’ll put this down for discussion in the upcoming Cylc project meeting, in case there’s anything we can do to make it easier for users stuck with your scenario.

As a workaround, I think you could you use global.cylc[scheduler][host self-identification]method = hardwired athough you’d have to choose one particular login node and stick with it.

@ejhyer - actually, it seems to us (on the Cylc team) that this method should work for your scenario.

Can you give more information about why it evidently doesn’t? With that global configuration, the scheduler’s IP address rather than its host name should appear in the contact file, ~/cylc-run/<workflow-id>/.service/contact, and that IP address ought to (I think) point to the actual host node.