I’m trying to validate and reinstall a workflow and I’m getting the following repeatable error:
WARNING - $ ssh -oBatchMode=yes -oConnectTimeout=10 10.150.8.241 env CYLC_VERSION=8.3.6 bash --login -c 'exec "$0" "$@"' cylc psutil # returned 255
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:FX/GwhFpp2mP5Ll1UDvgXdAGe3X58hEtF6dW71DPiF4.
Please contact your system administrator.
Add correct host key in /p/home/reinecke/.ssh/known_hosts to get rid of this message.
Offending ED25519 key in /p/home/reinecke/.ssh/known_hosts:8
Host key for 10.150.8.241 has changed and you have requested strict checking.
Host key verification failed.
ERROR - Cannot determine whether workflow is running on 10.150.8.241.
/p/app/projects/NEPTUNE/spack-stack/spack-stack-dev-20250109/envs/ce-gcc-10.3.0-direct/._view-pys/giuelw7snnxagq7f6dszn6cjdswau5ps/bin/python3
/p/app/projects/NEPTUNE/spack-stack/spack-stack-dev-20250109/envs/ce-gcc-10.3.0-direct/view-pys/bin/cylc vip -n test8
CRITICAL - Cannot tell if the workflow is running
Note, Cylc 8 cannot restart Cylc 7 workflows.
CylcError: Cannot determine whether workflow is running on 10.150.8.241.
/p/app/projects/NEPTUNE/spack-stack/spack-stack-dev-20250109/envs/ce-gcc-10.3.0-direct/._view-pys/giuelw7snnxagq7f6dszn6cjdswau5ps/bin/python3 /p/app/projects/NEPTUNE/spack-stack/spack-stack-dev-20250109/envs/ce-gcc-10.3.0-direct/view-pys/bin/cylc vip -n test8
I have read through the other posts associated with “Cannot determine whether workflow is running” errors and don’t think it’s covered in those, apologies for the duplication if I missed something.
What I don’t understand is, if I’m re-installing and reloading from the same login node host that initially installed and played from, why is it trying to ssh to itself using the hardwired IP? That IP is only visible from the compute nodes.
Ah, maybe we’ve wrongly assumed a hard-wired scheduler host address can be used for purposes other than job-to-scheduler communication. We’ll have to look at this.
I’ll see if can contrive a similar situation and suggest a fix or a workaround …
This can happen when SSH / network setup change. If you performed this SSH by hand, it would give you the opportunity to accept the host key by typing “yes”. Because this SSH is being performed non-interactively, you don’t have this opportunity.
You can either, perform the relevant SSH connections by hand and answer yes as needed. Or you can answer “yes” by default by turning off “StrictHostKeyChecking” for Cylc SSH connections by adding -o StrictHostKeyChecking=no onto the end of your ssh command for the relevant platform(s), e.g:
Maybe I misunderstood, but it isn’t the ssh attempt being made incorrectly in the first place? The scheduler host self-identification address should only be used from job hosts, right?
Maybe I misunderstood, but it isn’t the ssh attempt being made incorrectly in the first place? The scheduler host self-identification address should only be used from job hosts, right?
I’m not convinced that’s the case. I think the host self identification address will go into the contact file so may be used for other things besides (this is the cylc psutil check).
The host self-identification address is only significant on the compute node. The host cannot access itself using that address. So ssh -oBatchMode=yes -oConnectTimeout=10 10.150.8.241 env CYLC_VERSION=8.3.6 bash --login -c 'exec "$0" "$@"' cylc psutil would only ever work from a compute node.
@areinecke - have you found a way to get this working?
I’ve been away on leave, but I’ve now confirmed that a hardwired scheduler host name does end up in the local .service/contact file as well as the remote (job platform) one (and of course they may be the same file anyway, on a shared filesystem), and it does get used for operations that are launched interactively (i.e., not just by running jobs on job platforms), e.g.:
ssh to the scheduler host, to see if the scheduler is still running
tcp connections to the scheduler, for commands such as cylc reload
This is not surprising now that I think about it, but I suspect that none of the main (developer) sites have had to use this feature for a long time, so we didn’t notice it was kinda broken. I can see how it doesn’t help your case.
If possible, via network configuration, it would be best to avoid having to use hardwired host self-identification.
Beyond that, I’ll put up an Issue on the cylc-flow repository, but it might be a tricky one to fix. Cylc commands could be issued on:
the scheduler host
a host on the same local network as the scheduler host
we often have a pool of potential scheduler hosts, and other local “interactive nodes”
job platform hosts
which might or might not see the scheduler host at the same address
which might or might not see the same filesystem (and contact file) as the scheduler
Note I have “corrected” that statement. One of the team pointed out that this feature is working as advertised in the docs:
hardwired
(only to be used as a last resort) Manually specified host name or IP address (requires host) of the workflow host.
There’s no suggestion that the host name applies only from job hosts.)
(That said, I think this feature was originally intended specifically to handle identity of the scheduler host as seen from job hosts, for task messaging …)
(That said, I think this feature was originally intended specifically to handle identity of the scheduler host as seen from job hosts, for task messaging …)
Agree with this, I have emails dating back 10+ years describing that this is the intended purpose, we just haven’t had to change it because it worked. I imaging that v8 handles this differently than previous versions, which is why it’s failing now.
I’m not sure of a way to get around hardwiring the IP address because this is how the HPE-Cray’s are configured, however, I’ll reach out to the admins for suggestions. Would it be sufficient to have two contact files even if you’re on the same file system?
Yeah, it’s coming back to me now. My suspicion is Cylc 7 was only using this setting from inside job environments, and that was the original intention. But it’s been so long since the core Cylc sites had to use the setting that the original purpose slipped from our minds and in Cylc 8 we started using it in other contexts too.
Unfortunately this is tricky. Even if we had left this as it was in Cylc 7, it would still be wrong in principle - in modern Cylc this should not be a scheduler setting because there may be multiple job platforms that all see the scheduler host differently, so it should be install target or possibly job platform specific.
We will consider whether or not we should try to restore the old behaviour. In the meantime, you best bet might be:
get the network settings updated to make the scheduler host visible as itself from the job platform (/etc/hosts entry and/or gateway settings? … I’m not very network savvy)
or do it via your .ssh/config and configure Cylc to use ssh-based task communications - we think this will work and it is fully configurable by users
The IP address of the Cylc server cannot be accessed from the compute nodes.
But it can be accessed via some other hostname or IP address from these nodes.
Yes, that is correct.
One piece of good news though, I was running this on two separate systems that I thought both needed to use the method=hardwired option. It turns out that one of them can use method=name, so it’s only an issue half the time now.