`CylcError: Cannot determine whether workflow is running on` even though ssh connection works fine

abarral · February 17, 2024, 12:27am

Following the wonderful help I received here, I was able for some time to launch jobs running on a remote server, orchestrated on my local computer.

However, I tried again a few hours later and couldn’t anymore.
Here’s what now happens:

The first job launches successfully on the remote host, and I can read its log properly.

=======================TASK PROLOG================================
Job 766860 submitted on cluster ada from ada.xxx.com
[...]
==================================================================
Workflow : my_flow/run2
Job : 19790101T0000Z/init_batch/01 (try 1)
User@Host: xx@ada

2024-02-17T00:20:24Z INFO - started
2024-02-17T00:20:39Z INFO - succeeded

However in the UI, its state gets stuck on “submitted”. Looking at job.err, I read

CylcError: Cannot determine whether workflow is running on [local computer address].
CylcError: Cannot determine whether workflow is running on [local computer address].

The only difference I can think of is that I’m now using a wifi connection instead of a wired connection. It is entirely possible that when wired, since my laptop and the server ada were on the same-ish local network, they could communicate in some special way.

However, I clearly can ssh on my current connection, so I fail to see why the workflow is totally stuck.

A previous question has a similar problem in march 2023, but had an explicit error “Host key verification failed.”, which is absent here.

hilary.j.oliver · February 18, 2024, 8:56pm

You must not have the right networking set up, to get back from job platform to scheduler run host.

When a job platform is initialized, the scheduler puts a contact file in the workflow run directory there, which tells clients - including task jobs - how to connect to the scheduler (host, port, PID, …).

By default, job status messages (e.g. job started, and job succeeded or failed) are sent back from jobs via TCP messages.

If that fails, the client code tries to determine if the scheduler is still running or not, by running a command on the scheduler run host by non-interactive ssh.

If that fails, it reports that it can’t even determine whether or not the scheduler is running.

Your options are:

get the networking configuration changed to allow TCP and ssh back from the job platform
or configure Cylc to use one-way job polling to track job status - which only requires non-interactive ssh from scheduler host to job platform. That will work, but job status doesn’t get updated till the next poll time.

abarral · February 19, 2024, 8:29am

Thanks for your answer, as always . I was able to replicate my issue, and can now confirm that it works fine on the lab’s ethernet but not on the wifi.

By default, job status messages (e.g. job started, and job succeeded or failed) are sent back from jobs via TCP messages.

If that fails, the client code tries to determine if the scheduler is still running or not, by running a command on the scheduler run host by non-interactive ssh.

Where can I see the log of the communication trying to happen, and the returned errors ? I checked in [...]/logs/scheduler but there’s barely anything relevant.

abarral · February 19, 2024, 8:57am

I found the problem !!

As indicated in host self-authentication, [HJO - self-identification] the address of the workflow host is inferred using the name of the local host by default. While this often points to the IP of the host, many university laptops are configured to point to a local domain name, e.g. mycomputer.myuniversity.com. This gets resolved on the local university network, but nowhere else.

The solution is to set global.cylc[scheduler][host self-identification]method=address

I think this issue is likely to be encountered by other people in the future, and I think we should at least add a link to [host self-authentication] in the doc page that explains remote job launching.

MetRonnie · February 19, 2024, 1:12pm

If you link the page you are talking about, I can add that in

abarral · February 19, 2024, 1:26pm

I think it should be either Tracking Job Status — Cylc 8.2.3 documentation or Platform Configuration — Cylc 8.2.3 documentation (with a preference for the latter)

thanks !

oliver.sanders · February 19, 2024, 2:58pm

This isn’t really anything to do with either remote job submission / tracking or platform configuration.

This is an installation issue for sites that have dynamic host names that can cause a wide range of issues within Cylc (or other distributed systems).

The method=address solution should work presuming that the IP addresses are not also dynamic. If IP address are dynamic, you’ll have to use method=hardwired.

oliver.sanders · February 19, 2024, 3:07pm

It’s not a good idea to run Cylc workflows on transient machines (i.e. machines that could drop off the network or have unstable DNS) as workflow execution could become disrupted at any time.

You should look at setting up Cylc servers for the workflows to run. This way, users would start, stop, monitor and control workflows from their own machines, but the workflows themselves would run on hosts that reside on the network permanently (with fixed DNS). Cylc has built-in support for running workflows on a configured server or pool of servers, see this section for details:

https://cylc.github.io/cylc-doc/stable/html/user-guide/writing-workflows/scheduler.html#submitting-workflows-to-a-pool-of-hosts

We don’t have dynamic DNS at our site, but we still configure workflows to run on Cylc servers because it’s easier to manage workflows in this way. E.G. if the user restarts their machine, their workflows aren’t killed.

abarral · February 19, 2024, 3:11pm

We already have Cylc setup that way on the cluster

My main use case for running workflow from my local machine is to easily test quick workflows without having to port-forward the gui and upload the workflow files each time.

oliver.sanders · February 19, 2024, 3:25pm

Ah, good.

Dynamic DNS is not something that Cylc supports. Running the workflow locally is fine, but if it starts running jobs within the remote network, things will start going wrong. I think using host self-identification as you are doing will probably work ok, but this isn’t a pattern we test for or document.

hilary.j.oliver · February 19, 2024, 7:58pm

The scheduler can’t log job communication errors, if the jobs can’t even connect to the scheduler.

You need to look in the job.out and job.err files for the tasks sent to that job platform.

Topic		Replies	Views
Job.err: CylcError: Cannot determine whether workflow is running on Cylc Support	2	346	March 9, 2023
Failed (remote) cylc task fails but server fails to notice Cylc Support	5	516	July 28, 2022
Cylc vr --> Cannot determine whether workflow is running on <host> Cylc Support	13	123	January 30, 2025
Running on remote Cylc Support	8	780	September 28, 2019
How to specify listening IP address for CYLC execution host Cylc Support	9	653	September 18, 2020

`CylcError: Cannot determine whether workflow is running on` even though ssh connection works fine

Related topics