Cylc internode communication - when things go wrong

Hello Wizards of Cylc,

I am running cylc 8.2 on an HPC system. The PBS queues on this specific system have a very significant lag. One of the methods I have for working around this is to run a script on a compute node that will instantiate the cylc workflow and then keep alive as long as the suite is alive. When I do this, all jobs with batch system = background will run on the compute node. This keeps me from running afoul of the “login node abuse” policy on the HPC system.
However, the cylc daemon (or perhaps only the cylc GUI?), is still running on the login node, and communicates with the compute node.
The problem arises if the PBS job on the compute node dies unexpectedly. If this happens, cylc continues to ping the compute node (that is no longer associated with my job), which raises a red flag for the system admins.
Q1: My “fix” is to force-delete the workflow dir that is pointing to the dead job: rm -rf ~/cylc-run/SUITE/RUN#/. Obviously I can’t automate this. Is there another way?
Q2: Would you expect this only to be caused by the GUI, so I could simply shut down the GUI when I’m not monitoring it? (Not really ideal, but it takes <10 minutes to start it up so I guess that would be OK)
Q3: Any other suggestions for how to mitigate this behavior? Any settings that determine how long cylc (cylc GUI? or other?) keeps pinging non-responsive nodes?

Thanks for your help,
–Edward H.

Hi @ejhyer,

run a script on a compute node that will instantiate the cylc workflow and then keep alive as long as the suite is alive

So you submit a PBS job that runs a Cylc scheduler on the allocated compute node?

(For the keep-alive bit, are you aware of cylc play --no-detach? - the play command won’t exit until the workflow is finished)

If so, that is a recipe for ending up with orphaned workflow contact files (i.e., files left on disk that
indicate the scheduler is still running, and where it is running) - if the workflow runs beyond the PBS walltime limit it will be killed with no chance to clean up after itself.

Normally Cylc commands remove orphaned contact files automatically, after determining that the scheduler is no longer running where it is claimed to be. However, on one of our platforms the system prevents me from connecting to compute nodes that are no long allocated to me. In that case, Cylc does not delete the contact file automatically because it can’t safely determine if the scheduler is still running or not.

Is that what’s happening to you? It might help to see the actual error messages, or what exactly the “ping” process is that your admins are complaining about.

So you submit a PBS job that runs a Cylc scheduler on the allocated compute node?

Yes that’s what I am doing.

cylc play --no-detach

That’s a gem! I was not aware!

the system prevents me from connecting to compute nodes that are no long allocated to me. In that case, Cylc does not delete the contact file automatically because it can’t safely determine if the scheduler is still running or not.

I think this is exactly what is happening. I’ll try and get you some verbs from the cylc gui logging that supports that, once I get them to let me back in.

Q: Does this contact behavior change depending on whether or not the Cylc GUI is running?

The behaviour is the same for UI Server and CLI tools (verify that a workflow is still running, delete its contact file if not). But I think the UI Server may repeatedly attempt that, if it can’t connect to the supposed workflow host.

If you don’t have a UI Server running, there is no other process that will do this, unless you run a scheduler-connecting command. The only daemon processes in Cylc are the schedulers.

This sounds pretty clear, but I want to be certain I understand it right:

  1. if cylc has a contact file pointing to a remote node, it will attempt to contact that node ONLY:
    a) if specifically requested by a CLI command;
    b) if the cylc UI server (cylc gui or some other form) is running.

Q2. Are there other ways to let cylc know that workflow is inactive, apart from rm -rf cylc-run/SUITE/RUN#?

Q1: that’s right. There’s literally nothing else running that could do it.

Q2: you don’t have to delete the entire run directory, just the contact file that (normally) indicates the scheduler is still running (and which the scheduler itself removes on normal shutdown, but can’t if suffers a hard kill).

$ rm ~/cylc-run/<workflow-id>/.service/contact
1 Like