Cylc vr --> Cannot determine whether workflow is running on <host>

areinecke · January 13, 2025, 8:45pm

I’m trying to validate and reinstall a workflow and I’m getting the following repeatable error:

WARNING - $ ssh -oBatchMode=yes -oConnectTimeout=10 10.150.8.241 env CYLC_VERSION=8.3.6 bash --login -c 'exec "$0" "$@"' cylc psutil  # returned 255
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
    Someone could be eavesdropping on you right now (man-in-the-middle attack)!
    It is also possible that a host key has just been changed.
    The fingerprint for the ED25519 key sent by the remote host is
    SHA256:FX/GwhFpp2mP5Ll1UDvgXdAGe3X58hEtF6dW71DPiF4.
    Please contact your system administrator.
    Add correct host key in /p/home/reinecke/.ssh/known_hosts to get rid of this message.
    Offending ED25519 key in /p/home/reinecke/.ssh/known_hosts:8
    Host key for 10.150.8.241 has changed and you have requested strict checking.
    Host key verification failed.

ERROR - Cannot determine whether workflow is running on 10.150.8.241.
    /p/app/projects/NEPTUNE/spack-stack/spack-stack-dev-20250109/envs/ce-gcc-10.3.0-direct/._view-pys/giuelw7snnxagq7f6dszn6cjdswau5ps/bin/python3
    /p/app/projects/NEPTUNE/spack-stack/spack-stack-dev-20250109/envs/ce-gcc-10.3.0-direct/view-pys/bin/cylc vip -n test8
CRITICAL - Cannot tell if the workflow is running
    Note, Cylc 8 cannot restart Cylc 7 workflows.
CylcError: Cannot determine whether workflow is running on 10.150.8.241.
/p/app/projects/NEPTUNE/spack-stack/spack-stack-dev-20250109/envs/ce-gcc-10.3.0-direct/._view-pys/giuelw7snnxagq7f6dszn6cjdswau5ps/bin/python3 /p/app/projects/NEPTUNE/spack-stack/spack-stack-dev-20250109/envs/ce-gcc-10.3.0-direct/view-pys/bin/cylc vip -n test8

I have read through the other posts associated with “Cannot determine whether workflow is running” errors and don’t think it’s covered in those, apologies for the duplication if I missed something.

I’m running cylc 8.3.6 with

[scheduler]
    [[host self-identification]]
        method = hardwired
        host = 10.150.8.241

which is where the IP address is coming from.

What I don’t understand is, if I’m re-installing and reloading from the same login node host that initially installed and played from, why is it trying to ssh to itself using the hardwired IP? That IP is only visible from the compute nodes.

hilary.j.oliver · January 13, 2025, 9:18pm

Ah, maybe we’ve wrongly assumed a hard-wired scheduler host address can be used for purposes other than job-to-scheduler communication. We’ll have to look at this.

I’ll see if can contrive a similar situation and suggest a fix or a workaround …

oliver.sanders · January 14, 2025, 11:37am

Hi,

This can happen when SSH / network setup change. If you performed this SSH by hand, it would give you the opportunity to accept the host key by typing “yes”. Because this SSH is being performed non-interactively, you don’t have this opportunity.

You can either, perform the relevant SSH connections by hand and answer yes as needed. Or you can answer “yes” by default by turning off “StrictHostKeyChecking” for Cylc SSH connections by adding -o StrictHostKeyChecking=no onto the end of your ssh command for the relevant platform(s), e.g:

ssh command = ssh -oBatchMode=yes -oConnectTimeout=10 -o StrictHostKeyChecking=no

https://cylc.github.io/cylc-doc/stable/html/reference/config/global.html#global.cylc[platforms][<platform%20name>]ssh%20command

hilary.j.oliver · January 14, 2025, 11:29pm

Maybe I misunderstood, but it isn’t the ssh attempt being made incorrectly in the first place? The scheduler host self-identification address should only be used from job hosts, right?

oliver.sanders · January 15, 2025, 10:16am

Maybe I misunderstood, but it isn’t the ssh attempt being made incorrectly in the first place? The scheduler host self-identification address should only be used from job hosts, right?

I’m not convinced that’s the case. I think the host self identification address will go into the contact file so may be used for other things besides (this is the cylc psutil check).

areinecke · January 17, 2025, 9:37pm

The host self-identification address is only significant on the compute node. The host cannot access itself using that address. So ssh -oBatchMode=yes -oConnectTimeout=10 10.150.8.241 env CYLC_VERSION=8.3.6 bash --login -c 'exec "$0" "$@"' cylc psutil would only ever work from a compute node.

hilary.j.oliver · January 27, 2025, 2:14am

@areinecke - have you found a way to get this working?

I’ve been away on leave, but I’ve now confirmed that a hardwired scheduler host name does end up in the local .service/contact file as well as the remote (job platform) one (and of course they may be the same file anyway, on a shared filesystem), and it does get used for operations that are launched interactively (i.e., not just by running jobs on job platforms), e.g.:

ssh to the scheduler host, to see if the scheduler is still running
tcp connections to the scheduler, for commands such as cylc reload

This is not surprising now that I think about it, but ~~I suspect that none of the main (developer) sites have had to use this feature for a long time, so we didn’t notice it was kinda broken.~~ I can see how it doesn’t help your case.

If possible, via network configuration, it would be best to avoid having to use hardwired host self-identification.

Beyond that, I’ll put up an Issue on the cylc-flow repository, but it might be a tricky one to fix. Cylc commands could be issued on:

the scheduler host
a host on the same local network as the scheduler host
- we often have a pool of potential scheduler hosts, and other local “interactive nodes”
job platform hosts
- which might or might not see the scheduler host at the same address
- which might or might not see the same filesystem (and contact file) as the scheduler

hilary.j.oliver · January 27, 2025, 6:53pm

Note I have “corrected” that statement. One of the team pointed out that this feature is working as advertised in the docs:

hardwired
(only to be used as a last resort) Manually specified host name or IP address (requires host) of the workflow host.

There’s no suggestion that the host name applies only from job hosts.)

(That said, I think this feature was originally intended specifically to handle identity of the scheduler host as seen from job hosts, for task messaging …)

areinecke · January 27, 2025, 11:04pm

(That said, I think this feature was originally intended specifically to handle identity of the scheduler host as seen from job hosts, for task messaging …)

Agree with this, I have emails dating back 10+ years describing that this is the intended purpose, we just haven’t had to change it because it worked. I imaging that v8 handles this differently than previous versions, which is why it’s failing now.

I’m not sure of a way to get around hardwiring the IP address because this is how the HPE-Cray’s are configured, however, I’ll reach out to the admins for suggestions. Would it be sufficient to have two contact files even if you’re on the same file system?

hilary.j.oliver · January 28, 2025, 2:01am

Yeah, it’s coming back to me now. My suspicion is Cylc 7 was only using this setting from inside job environments, and that was the original intention. But it’s been so long since the core Cylc sites had to use the setting that the original purpose slipped from our minds and in Cylc 8 we started using it in other contexts too.

Unfortunately this is tricky. Even if we had left this as it was in Cylc 7, it would still be wrong in principle - in modern Cylc this should not be a scheduler setting because there may be multiple job platforms that all see the scheduler host differently, so it should be install target or possibly job platform specific.

We will consider whether or not we should try to restore the old behaviour. In the meantime, you best bet might be:

get the network settings updated to make the scheduler host visible as itself from the job platform (/etc/hosts entry and/or gateway settings? … I’m not very network savvy)
or do it via your .ssh/config and configure Cylc to use ssh-based task communications - we think this will work and it is fully configurable by users

oliver.sanders · January 28, 2025, 10:07am

or do it via your .ssh/config and configure Cylc to use ssh-based task communications

To elaborate on this approach. Rather than hardcoding the hostname in the Cylc config, you hardcode it in the SSH config:

# .ssh/config (on the remote host)

Host cylcserver
    HostName hardcoded-hostname

Then tell Cylc to use SSH to communicate back to the Cylc server:

# .cylc/global.cylc (on the cylc server)

[platforms]
    [[myplatform]]
        communications method = ssh

(note, both of these configs can be set at either the user or system level)

oliver.sanders · January 28, 2025, 10:10am

It might be helpful for us to get a better handle on the use case.

Can I confirm that:

The IP address of the Cylc server cannot be accessed from the compute nodes.
But it can be accessed via some other hostname or IP address from these nodes.

I.E, there is no canonical hostname or IP address for the Cylc server which is consistent across the network.

areinecke · January 30, 2025, 12:28am

Can I confirm that:

The IP address of the Cylc server cannot be accessed from the compute nodes.

But it can be accessed via some other hostname or IP address from these nodes.

Yes, that is correct.

One piece of good news though, I was running this on two separate systems that I thought both needed to use the method=hardwired option. It turns out that one of them can use method=name, so it’s only an issue half the time now.

dpmatthews · January 30, 2025, 7:39am

Have you tried ssh-based task communications yet? Hopefully that can be made to work.

Topic		Replies	Views
`CylcError: Cannot determine whether workflow is running on` even though ssh connection works fine Cylc Support	10	186	February 19, 2024
Is cylc 8 able to interact with workflows launched on different hosts? Cylc Support	3	327	April 12, 2023
Job.err: CylcError: Cannot determine whether workflow is running on Cylc Support	2	328	March 9, 2023
Cannot tell if the workflow is running error in failed workflow Cylc Support	9	63	October 30, 2024
ERROR: No hosts currently compatible with this global configuration: Cylc Support	11	503	May 20, 2021

Cylc vr --> Cannot determine whether workflow is running on <host>

Related topics