Restarting failed workflow

I have a situation where a task in my workflow failed because a file was not staged in the right location. I moved the file, and would now like to just restart the workflow from where it ran into the problem. My understanding is I should just be able to run cylc play , but when I do that as far as I can tell nothing happens. Am I missing a step? This is using cylc8rc3.

Yes, cylc play should always pick up where the previous run left off.

By failed I guess you mean that some of the tasks within the workflow failed rather than the Cylc scheduler failed?

If the workflow is shutting down immediately after starting up it usually means that the workflow has already run to completion. Try inspecting the workflow log file, if there were any internal issues they will be reported there.

If your workflow has already run to completion but you would like to “rewind” it back a bit and rerun try the following:

# restart the workflow (the --pause will stop it shutting down straight away)
cylc play --pause workflow

# trigger a new "flow" with the first task you wish to rerun
# for more info see https://cylc.github.io/cylc-doc/latest/html/user-guide/running-workflows/reflow.html
cylc trigger --flow=new workflow//cycle/task

# unpause the workflow
cylc play workflow

New “flows” allow you to have multiple “logical runs” of the same workflow running under a single scheduler. They are useful for situations where you need to re-run part of a workflow e.g. due to data issues. They are a brand-new feature of Cylc 8.

https://cylc.github.io/cylc-doc/latest/html/user-guide/running-workflows/reflow.html

It definitely hadn’t run to completion, because one of the intermediate tasks exited with an error and no other tasks ran after that. I’ll take a look at the logs and see if I can figure out what was going on with the restart attempt.

When you restart a workflow it continues in the same way it would have if you never shut it down. So, if a task was in the failed state before shutdown it will remain in the failed state after the restart. To rerun the failed task use cylc trigger on that task.

Hey everyone - a blast from the past! I got pulled away from this problem for a while but am now back messing with it and still struggling. It seems like no matter which combination of steps I take if a workflow fails my only recourse is to create a brand new workflow. Does someone have a really basic “hello world” example of restarting a failed workflow that I could try?

Here’s a trivial flow.cylc:

[scheduling]
    [[graph]]
        R1 = hello
[runtime]
    [[hello]]  # fail the first time
        script = "((CYLC_TASK_SUBMIT_NUMBER > 1)) && echo Hello"

If you use cylc play workflow to run this it will stall (the scheduler remains running with task “hello” in the failed state). If you then trigger the failed task (cylc trigger workflow//1/hello) the task will run again, succeed and the scheduler will shutdown.
It makes no difference if you stop the workflow (cylc stop workflow) after the task has failed and then restart it again (cylc play workflow) - the workflow just starts up and stalls with task “hello” in the failed state.

Thanks! Dumb question time - does the //1/ in this case indicate that we want to restart from the first attempt to run hello?

I was able to get the test case to work, but I am running into an error with my production suite. In this case I had to cancel a job I submitted because I had the run length wrong. I now just want to restart the flow and rerun that step. When I go through the steps listed here, I get errors like:

(cylc8) login2.frontera(1171)$ cylc trigger --flow=new prototype-p8/p8.benchmark.24h.x8.y8//1/run_model_ldate2012010100_mem001
Done
(cylc8) login2.frontera(1172)$ cplay prototype-p8/p8.benchmark.24h.x8.y8
CylcError: Cannot determine whether workflow is running on 129.114.63.99.
/home1/02441/bcash/.conda/envs/cylc8/bin/python /home1/02441/bcash/.conda/envs/cylc8/bin/cylc play --pause prototype-p8/p8.benchmark.24h.x8.y8

It’s the cycle point .
See https://cylc.github.io/cylc-doc/stable/html/7-to-8/major-changes/cli.html#cylc-8-standardised-ids

Thanks. My workflow isn’t cycled (for the most part) so I don’t tend to think of things in terms of cycle points. Since I don’t have any cycling information specified it defaults to 1?

Since I don’t have any cycling information specified it defaults to 1?

Yes.


In this case I had to cancel a job I submitted because I had the run length wrong. I now just want to restart the flow and rerun that step.

You do not need to restart the workflow to re-run a task.

When tasks fail in Cylc the workflow remains running (note you can list running workflows with cylc scan or watch them live in the GUI). You can tell Cylc to re-run a task by running cylc trigger:

# "cycle_point" is "1" for your example
$ cylc trigger <workflow_id>//<cycle_point>/<task_name>

(Note: you can also do this from the context menu opened by clicking on the task in the GUI)


(cylc8) login2.frontera(1172)$ cplay prototype-p8/p8.benchmark.24h.x8.y8
CylcError: Cannot determine whether workflow is running on 129.114.63.99.

I’m guessing cplay is cylc play?

The error message you are seeing here is an installation / system issue most likely caused by an SSH failure.

Cylc can’t tell whether the workflow is running or not because it cannot SSH to 129.114.63.99 from login2.frontera.

Sorry, yes, cplay is “cylc play”. I’ve aliased a bunch of the commands that get used over and over like that.

Frontera is a pretty standard hpc setup with login and compute nodes. Is there something I need to change in my global.cylc or elsewhere to make this work?

Does ssh 129.114.63.99 work?

It did, but first I had to answer a fingerprint prompt. Maybe that was causing it to hang?

For automated ssh access to other hosts, by Cylc or otherwise, you need to configure ssh to use public/private keys, with no interactive passphrase entry.

Even if you do that, by default ssh presents an interactive prompt the very first time you connect to a host (I suspect this is what you mean?).

Options are:

  • initiate a connection manually the first time, then Cylc is good to go
  • tell ssh not to do the initial “strict host key check”: e.g. ssh -o stricthostkeychecking=no

(Note you can configure the form of the ssh command used by Cylc, in global.cylc)

We have a pending issue, to document this for users.

Back again - I am finding that even when I have done an initial manual connection to a host and can subsequently connect without a prompt I am getting errors like the following:
(cylc8) login3.frontera(1180)$ cylc play prototype-p8/restart_test.3
CylcError: Cannot determine whether workflow is running on 129.114.63.100.
/home1/02441/bcash/.conda/envs/cylc8/bin/python /home1/02441/bcash/.conda/envs/cylc8/bin/cylc play prototype-p8/restart_test.3

I have also turned off strict host checking in my global.cylc. Is there another possible explanation for this issue?

A few questions, to help debug this:

  • have you checked that your global config is being picked up? (e.g. run cylc config and check the output)
  • when you get the error message above, are you saying that you can successfully ssh to the reported host (129.114.63.100), and without a password prompt?
  • have you configured scheduler “run hosts” or is 129.114.63.100 just the IP of the host you were logged on to when you started the previous run of the workflow (that you are now trying to restart)?
  • also try cylc play --debug --no-detach to see you get any additional information

The only time I’ve seen this error myself recently was this: some external users get onto our HPC via a JupyterHub-launched interactive Slurm session. The session gets allocated to some available compute node, and their Cylc schedulers run locally (i.e. in the same node) if they don’t configure specific scheduler run hosts. If a scheduler is still running when the Slurm session ends, it gets killed by Slurm. Then, if the user logs back in and tries to restart the workflow, the scheduler “contact file” records that it is/was running on the former interactive node, so Cylc tries to ssh there to see if it is still running or not. But the system does not let users ssh directly to these nodes if they have no processes running there … hence the above error from cylc play.

… however, I suppose that can’t be your problem, if you say you can ssh to the host in the error message?

  • Yes, global config is being picked up
  • Yes, I can ssh to that host without a password prompt
  • I have not specified anything in “run hosts”, so this is just whatever I got.
  • cylc play --debug --no-detach came back with some very interesting info! This is what it returned.

(cylc8) login3.frontera(1260)$ cstop prototype-p8/x12y12.med300.reduced.cpn48
CylcError: Cannot determine whether workflow is running on 129.114.63.100.
/home1/02441/bcash/.conda/envs/cylc8/bin/python /home1/02441/bcash/.conda/envs/cylc8/bin/cylc play prototype-p8/x12y12.med300.cpn48
(cylc8) login3.frontera(1261)$ cylc play --debug --no-detach prototype-p8/x12y12.med300.reduced.cpn48
2022-09-20T08:14:39-05:00 DEBUG - Loading site/user config files
2022-09-20T08:14:39-05:00 DEBUG - Reading file /home1/02441/bcash/.cylc/flow/8.0.1/global.cylc
2022-09-20T08:14:40-05:00 DEBUG - $ ssh -oBatchMode=yes -oConnectTimeout=10 129.114.63.100 env CYLC_VERSION=8.0.1 bash --login -c ‘exec “$0”
“$@”’ cylc psutil # returned 255
To access the system:
1) If not using ssh-keys, please enter your TACC password at the password prompt
2) At the TACC Token prompt, enter your 6-digit code followed by .
bcash@129.114.63.100: Permission denied (keyboard-interactive).

However, ssh bcash@129.114.63.100 goes through just fine if I enter it manually from the command line. One strange thing from that debug info is that I set stricthostkeychecking=no in my global.cylc but it doesn’t seem to have been picked up here:

(cylc8) login3.frontera(1267)$ cylc config
[scheduler]
[[host self-identification]]
method = address
[platforms]
[[cheyenne]]
job runner = pbs
install target = localhost
hosts = localhost
[[expanse]]
job runner = slurm
install target = localhost
hosts = localhost
[[stampede2]]
job runner = slurm
install target = localhost
hosts = localhost
[[frontera]]
job runner = slurm
install target = localhost
hosts = localhost
ssh command = ssh -oBatchMode=yes -oConnectTimeout=10 -ostricthostkeychecking=no

Do I have that specified incorrectly somehow?

Hi there,

Reported Traceback

Permission denied (keyboard-interactive).

I think this confirms that SSH required an interactive prompt so has not been configured to work non-interactively.

Unfortunately this is very hard for us to debug from our end as we don’t have access to the system or understanding of its configuration. It’s likely that this can be resolved in your SSH config file, but sadly I can’t tell you how from here.

SSH

However, ssh bcash@129.114.63.100 goes through just fine

Could you try running this command (exactly as written, don’t add bcash@ to the start of the host IP):

ssh -oBatchMode=yes -oConnectTimeout=10 129.114.63.100 hostname

Then try with -oStrictHostKeyChecking turned off

ssh -oStrictHostKeyChecking=no -oBatchMode=yes -oConnectTimeout=10 129.114.63.100 hostname
  • If only the second command works then we need to work at adding -oStrictHostKeyChecking onto the SSH command.
  • If neither work you’ll likely need to do some work to your SSH configuration.
  • If both work I’m confused.

The really strange thing is that this SSH should only happen if:

  • More than one host has been configured by [scheduler][run hosts]available.
  • OR if host ranking has been configured by [scheduler][run hosts]ranking.

So it shouldn’t happen at all for you if your [scheduler][run hosts] section is blank.

StrictHostKeyChecking

One strange thing from that debug info is that I set stricthostkeychecking=no in my global.cylc but it doesn’t seem to have been picked up here:

It looks like you have configured the ssh command for the “frontera” platform, however, I don’t think the SSH being performed is to the frontera platform, I think it is actually an SSH to localhost, can you confirm this?

It is not presently possible to configure the ssh command used for the “run hosts” in this way (only the platforms where jobs are submitted), I’ll try to prioritise this.

Hi @oliver.sanders - Neither worked, so I did some poking around and the options that is causing the failure seems to be -oBatchMode=yes and including “hostname” at the end. Setting BatchMode to no or leaving it out avoid the “permission denied” error. Including “hostname” just causes the ssh to fail with no message.