Restarting failed workflow

oliver.sanders · September 20, 2022, 3:00pm

The hostname command just prints the name of the host e.g:

$ ssh foo hostname
foo
$ echo $?
0

$ ssh foo hostname -I
1.2.3.4
$ echo $?
0

bencash · September 20, 2022, 3:03pm

Ah, so it should not actually have tried to connect, just print the hostname when I ran that command? In which case the issue is just “BatchMode=yes” leads to the permission denied error.

oliver.sanders · September 20, 2022, 3:04pm

(ssh foo hostname should SSH to the host called “foo” and run the hostname command on that host)

hilary.j.oliver · September 21, 2022, 5:15am

I agree!

@bencash - the problem occurs when cylc tries to connect to the recorded run host to see if the scheduler is still running there.

If you have not configured any run hosts, cylc play will start schedulers on localhost. Later if you try to restart the workflow, cylc play will try to connect to the recorded run host to see if the scheduler is still running. If you do that from the original host, the resulting ssh connection will be to localhost.

(I have a feeling it’s next to impossible to be sure if a target host is actually the same as the localhost without going there, so Cylc probably treats all hosts equally for this purpose). [UPDATE: see below, I guess this is not true!]

If you configure ssh for non-interactive use on a cluster it should automatically work between all nodes, including localhost-to-localhost, because .ssh/ is on the shared filesystem.

However, we configure non-standard ssh command options in Cylc per platform, since it may be platform specific. Platforms represent job hosts however, rather than scheduler run hosts.

But:

That’s actually not true If you configure an ssh command under the localhost platform, that will be used to connect to scheduler run hosts [CONFIRMED … and see ssh stricthostkeychecking · Issue #512 · cylc/cylc-doc · GitHub], ~~including - I believe - localhost~~[NOT CONFIRMED, apparently it is smart enough not to use ssh on localhost].

We have an issue on the cylc-doc repository to document this, but it’s not done yet sorry. ssh stricthostkeychecking · Issue #512 · cylc/cylc-doc · GitHub

The good news is, it’s just a one-off teething problem for the Cylc admin/installer at a new site. Once you get it working it will stay working.

oliver.sanders · September 21, 2022, 8:27am

Currently SSH connections “run hosts” do NOT use the localhost platform settings (incl “ssh cmd”)

There are two connection issues contained in the tracebacks above, one is the “check if the workflow is still running on the workflow server” check:

(cylc8) login2.frontera(1172)$ cplay prototype-p8/p8.benchmark.24h.x8.y8
CylcError: Cannot determine whether workflow is running on 129.114.63.99.

The other is the host-selection psutil call:

2022-09-20T08:14:40-05:00 DEBUG - $ ssh -oBatchMode=yes -oConnectTimeout=10 129.114.63.100 env CYLC_VERSION=8.0.1 bash --login -c ‘exec “$0”
“$@”’ cylc psutil # returned 255

Both calls are failing.

The first SSH call DOES use the localhost platform configuration, the second does not. I’m currently working on addressing this.

hilary.j.oliver · September 21, 2022, 9:32am

OK, I stand (partially) corrected!

I checked the first case, and assumed the second would use the same ssh command - that’s a bug.

oliver.sanders · September 21, 2022, 10:08am

I’m working on the bugfix, however, I’m not convinced that this is the source of the problem here.

“BatchMode=yes” leads to the permission denied error.

That doesn’t sound right at all! From the conversation here it looks to me like something about the SSH / network setup is very wrong. Especially as you are also having to use the address host self-identification method to get around DNS issues I’d suggest talking to the people who set up the network if possible.

bencash · September 21, 2022, 6:46pm

Thanks @oliver.sanders. I’ve put in a consulting ticket to ask about what is going on with batchmode.

bencash · September 21, 2022, 6:50pm

@hilary.j.oliver - Alas, I am the cylc admin/installer.

bencash · September 22, 2022, 1:01pm

Hi @oliver.sanders, here is the response I got from TACC consulting:

BatchMode
If set to yes, user interaction such as password prompts and host key confirmation
requests will be disabled.

For this reason all login nodes at TACC uses the recommended default from OpenSSH
of “BatchMode=no”.

hilary.j.oliver · September 22, 2022, 9:41pm

Using the “recommended default” as a default is perfectly reasonable and expected, but if they are disabling batch mode entirely then scripted or otherwise automated ssh connection between hosts is impossible.

If that’s the case, you might need to make the “business case” that that’s an unreasonable restriction - users need batch mode to get their work done.

Cylc runs distributed workflows - it has to be able to interact with job hosts via batch mode ssh.

bencash · September 23, 2022, 4:18pm

TACC confirms that this is not just a default but batchmode is disabled. I can’t see them budging on this no matter how reasonable a case I make for it, so I think I will just have to deal with it.

hilary.j.oliver · September 23, 2022, 8:56pm

Ouch. Well that certainly explains your problems above

Disallowing batch mode ssh between internal nodes is really unfortunate.

Off the top of my head, Cylc needs batch mode ssh for the following:

To start schedulers on other hosts - if you have configured a pool of run hosts
To submit jobs, if the job “platform” for a task - i.e. where the job needs to be submitted to run is not the same as the scheduler run host
To query and kill jobs, if the task job platform is not the same as the scheduler run host (note interrogating PBS or Slurm is not sufficient in general, in the job query case)
To check if a scheduler is still running, if the run directory says it is already running on another run host - we have check the process table on that host
(and also, under rsync, to install workflow files to the run directory - if the nominated platform install target is not where you run cylc install)
(I think that’s all…)

In principle you should still be able to use Cylc on this system if you can run schedulers on localhost (i.e. the node you are logged in on) and can submit jobs locally too (e.g. to PBS or Slurm) and only ever use cylc scan -t rich and cylc play from the same login node that the scheduler is running on. Although we might have to check if any unnecessary localhost-to-localhost ssh is attempted.

bencash · September 23, 2022, 9:24pm

Interestingly I can still use cylc scan and cylc stop to query workflows and to stop them, generally without issue, although I do occasionally get zombies. I am also able to submit jobs in the way you describe, so I am definitely still able to get use out of cylc. I was even able to complete @oliver.sanders test case above.

This does explain why I was never able to get this working on stampede2 though, which is also at TACC and I’m sure uses the same setup.

hilary.j.oliver · September 23, 2022, 9:30pm

Ok, well that’s some good news at least!

Yes commands to interact with a running workflow (stop, trigger, etc.) will still work from other nodes because they go via tcp.

dpmatthews · September 26, 2022, 8:45am

@bencash Have you tried configuring the ssh command for the localhost platform in the global config yet?

[platforms]
    [[localhost]]
        ssh command = ssh -oConnectTimeout=10 -oStrictHostKeyChecking=no

That should prevent Cylc using batchmode when checking if the workflow is still running.
Does this setting fix your problem with cylc play?

Topic		Replies	Views
Restarting a workflow mid-way when it has finished, running 1 task Cylc 8 Migration	4	477	August 16, 2022
Any way to replay from broken task instead of playing from start? Cylc Support	5	52	June 10, 2025
Retry from a previous task Cylc Support	2	205	August 23, 2023
Can't restart workflow Cylc 8 Migration	5	243	April 20, 2023
Cannot tell if the workflow is running error in failed workflow Cylc Support	9	62	October 30, 2024

Restarting failed workflow

Related topics