Restarting failed workflow

The hostname command just prints the name of the host e.g:

$ ssh foo hostname
foo
$ echo $?
0
$ ssh foo hostname -I
1.2.3.4
$ echo $?
0

Ah, so it should not actually have tried to connect, just print the hostname when I ran that command? In which case the issue is just “BatchMode=yes” leads to the permission denied error.

(ssh foo hostname should SSH to the host called “foo” and run the hostname command on that host)

I agree!

@bencash - the problem occurs when cylc tries to connect to the recorded run host to see if the scheduler is still running there.

If you have not configured any run hosts, cylc play will start schedulers on localhost. Later if you try to restart the workflow, cylc play will try to connect to the recorded run host to see if the scheduler is still running. If you do that from the original host, the resulting ssh connection will be to localhost.

(I have a feeling it’s next to impossible to be sure if a target host is actually the same as the localhost without going there, so Cylc probably treats all hosts equally for this purpose). [UPDATE: see below, I guess this is not true!]

If you configure ssh for non-interactive use on a cluster it should automatically work between all nodes, including localhost-to-localhost, because .ssh/ is on the shared filesystem.

However, we configure non-standard ssh command options in Cylc per platform, since it may be platform specific. Platforms represent job hosts however, rather than scheduler run hosts.

But:

That’s actually not true :tada: If you configure an ssh command under the localhost platform, that will be used to connect to scheduler run hosts [CONFIRMED … and see ssh stricthostkeychecking · Issue #512 · cylc/cylc-doc · GitHub], including - I believe - localhost[NOT CONFIRMED, apparently it is smart enough not to use ssh on localhost].

We have an issue on the cylc-doc repository to document this, but it’s not done yet sorry. ssh stricthostkeychecking · Issue #512 · cylc/cylc-doc · GitHub

The good news is, it’s just a one-off teething problem for the Cylc admin/installer at a new site. Once you get it working it will stay working.

Currently SSH connections “run hosts” do NOT use the localhost platform settings (incl “ssh cmd”)

There are two connection issues contained in the tracebacks above, one is the “check if the workflow is still running on the workflow server” check:

(cylc8) login2.frontera(1172)$ cplay prototype-p8/p8.benchmark.24h.x8.y8
CylcError: Cannot determine whether workflow is running on 129.114.63.99.

The other is the host-selection psutil call:

2022-09-20T08:14:40-05:00 DEBUG - $ ssh -oBatchMode=yes -oConnectTimeout=10 129.114.63.100 env CYLC_VERSION=8.0.1 bash --login -c ‘exec “$0”
“$@”’ cylc psutil # returned 255

Both calls are failing.

The first SSH call DOES use the localhost platform configuration, the second does not. I’m currently working on addressing this.

OK, I stand (partially) corrected!

I checked the first case, and assumed the second would use the same ssh command - that’s a bug.

I’m working on the bugfix, however, I’m not convinced that this is the source of the problem here.

“BatchMode=yes” leads to the permission denied error.

That doesn’t sound right at all! From the conversation here it looks to me like something about the SSH / network setup is very wrong. Especially as you are also having to use the address host self-identification method to get around DNS issues I’d suggest talking to the people who set up the network if possible.

Thanks @oliver.sanders. I’ve put in a consulting ticket to ask about what is going on with batchmode.

@hilary.j.oliver - Alas, I am the cylc admin/installer. :cry:

Hi @oliver.sanders, here is the response I got from TACC consulting:

BatchMode
If set to yes, user interaction such as password prompts and host key confirmation
requests will be disabled.

For this reason all login nodes at TACC uses the recommended default from OpenSSH
of “BatchMode=no”.

Using the “recommended default” as a default is perfectly reasonable and expected, but if they are disabling batch mode entirely then scripted or otherwise automated ssh connection between hosts is impossible.

If that’s the case, you might need to make the “business case” that that’s an unreasonable restriction - users need batch mode to get their work done.

Cylc runs distributed workflows - it has to be able to interact with job hosts via batch mode ssh.

TACC confirms that this is not just a default but batchmode is disabled. I can’t see them budging on this no matter how reasonable a case I make for it, so I think I will just have to deal with it.

Ouch. Well that certainly explains your problems above :exploding_head:

Disallowing batch mode ssh between internal nodes is really unfortunate.

Off the top of my head, Cylc needs batch mode ssh for the following:

  • To start schedulers on other hosts - if you have configured a pool of run hosts
  • To submit jobs, if the job “platform” for a task - i.e. where the job needs to be submitted to run is not the same as the scheduler run host
  • To query and kill jobs, if the task job platform is not the same as the scheduler run host (note interrogating PBS or Slurm is not sufficient in general, in the job query case)
  • To check if a scheduler is still running, if the run directory says it is already running on another run host - we have check the process table on that host
  • (and also, under rsync, to install workflow files to the run directory - if the nominated platform install target is not where you run cylc install)
  • (I think that’s all…)

In principle you should still be able to use Cylc on this system if you can run schedulers on localhost (i.e. the node you are logged in on) and can submit jobs locally too (e.g. to PBS or Slurm) and only ever use cylc scan -t rich and cylc play from the same login node that the scheduler is running on. Although we might have to check if any unnecessary localhost-to-localhost ssh is attempted.

Interestingly I can still use cylc scan and cylc stop to query workflows and to stop them, generally without issue, although I do occasionally get zombies. I am also able to submit jobs in the way you describe, so I am definitely still able to get use out of cylc. I was even able to complete @oliver.sanders test case above.

This does explain why I was never able to get this working on stampede2 though, which is also at TACC and I’m sure uses the same setup.

Ok, well that’s some good news at least!

Yes commands to interact with a running workflow (stop, trigger, etc.) will still work from other nodes because they go via tcp.

@bencash Have you tried configuring the ssh command for the localhost platform in the global config yet?

[platforms]
    [[localhost]]
        ssh command = ssh -oConnectTimeout=10 -oStrictHostKeyChecking=no

That should prevent Cylc using batchmode when checking if the workflow is still running.
Does this setting fix your problem with cylc play?