`CylcError: Cannot determine whether workflow is running on` even though ssh connection works fine

Following the wonderful help I received here, I was able for some time to launch jobs running on a remote server, orchestrated on my local computer.

However, I tried again a few hours later and couldn’t anymore.
Here’s what now happens:

  • The first job launches successfully on the remote host, and I can read its log properly.
=======================TASK PROLOG================================
Job 766860 submitted on cluster ada from ada.xxx.com
[...]
==================================================================
Workflow : my_flow/run2
Job : 19790101T0000Z/init_batch/01 (try 1)
User@Host: xx@ada

2024-02-17T00:20:24Z INFO - started
2024-02-17T00:20:39Z INFO - succeeded

However in the UI, its state gets stuck on “submitted”. Looking at job.err, I read

CylcError: Cannot determine whether workflow is running on [local computer address].
CylcError: Cannot determine whether workflow is running on [local computer address].

The only difference I can think of is that I’m now using a wifi connection instead of a wired connection. It is entirely possible that when wired, since my laptop and the server ada were on the same-ish local network, they could communicate in some special way.

However, I clearly can ssh on my current connection, so I fail to see why the workflow is totally stuck.

A previous question has a similar problem in march 2023, but had an explicit error “Host key verification failed.”, which is absent here.

You must not have the right networking set up, to get back from job platform to scheduler run host.

When a job platform is initialized, the scheduler puts a contact file in the workflow run directory there, which tells clients - including task jobs - how to connect to the scheduler (host, port, PID, …).

By default, job status messages (e.g. job started, and job succeeded or failed) are sent back from jobs via TCP messages.

If that fails, the client code tries to determine if the scheduler is still running or not, by running a command on the scheduler run host by non-interactive ssh.

If that fails, it reports that it can’t even determine whether or not the scheduler is running.

Your options are:

  • get the networking configuration changed to allow TCP and ssh back from the job platform
  • or configure Cylc to use one-way job polling to track job status - which only requires non-interactive ssh from scheduler host to job platform. That will work, but job status doesn’t get updated till the next poll time.

Thanks for your answer, as always :slight_smile: . I was able to replicate my issue, and can now confirm that it works fine on the lab’s ethernet but not on the wifi.

By default, job status messages (e.g. job started, and job succeeded or failed) are sent back from jobs via TCP messages.

If that fails, the client code tries to determine if the scheduler is still running or not, by running a command on the scheduler run host by non-interactive ssh.

Where can I see the log of the communication trying to happen, and the returned errors ? I checked in [...]/logs/scheduler but there’s barely anything relevant.

I found the problem !!

As indicated in host self-authentication, [HJO - self-identification] the address of the workflow host is inferred using the name of the local host by default. While this often points to the IP of the host, many university laptops are configured to point to a local domain name, e.g. mycomputer.myuniversity.com. This gets resolved on the local university network, but nowhere else.

The solution is to set global.cylc[scheduler][host self-identification]method=address

I think this issue is likely to be encountered by other people in the future, and I think we should at least add a link to [host self-authentication] in the doc page that explains remote job launching.

If you link the page you are talking about, I can add that in

1 Like

I think it should be either Tracking Job Status — Cylc 8.2.3 documentation or Platform Configuration — Cylc 8.2.3 documentation (with a preference for the latter)

thanks !

This isn’t really anything to do with either remote job submission / tracking or platform configuration.

This is an installation issue for sites that have dynamic host names that can cause a wide range of issues within Cylc (or other distributed systems).

The method=address solution should work presuming that the IP addresses are not also dynamic. If IP address are dynamic, you’ll have to use method=hardwired.

It’s not a good idea to run Cylc workflows on transient machines (i.e. machines that could drop off the network or have unstable DNS) as workflow execution could become disrupted at any time.

You should look at setting up Cylc servers for the workflows to run. This way, users would start, stop, monitor and control workflows from their own machines, but the workflows themselves would run on hosts that reside on the network permanently (with fixed DNS). Cylc has built-in support for running workflows on a configured server or pool of servers, see this section for details:

https://cylc.github.io/cylc-doc/stable/html/user-guide/writing-workflows/scheduler.html#submitting-workflows-to-a-pool-of-hosts

We don’t have dynamic DNS at our site, but we still configure workflows to run on Cylc servers because it’s easier to manage workflows in this way. E.G. if the user restarts their machine, their workflows aren’t killed.

We already have Cylc setup that way on the cluster :wink:

My main use case for running workflow from my local machine is to easily test quick workflows without having to port-forward the gui and upload the workflow files each time.

Ah, good.

Dynamic DNS is not something that Cylc supports. Running the workflow locally is fine, but if it starts running jobs within the remote network, things will start going wrong. I think using host self-identification as you are doing will probably work ok, but this isn’t a pattern we test for or document.

The scheduler can’t log job communication errors, if the jobs can’t even connect to the scheduler.

You need to look in the job.out and job.err files for the tasks sent to that job platform.