How to restart from an earlier cycle?

I occasionally needed to stop my workflow and start again from a previous cycle. For example, to fill in gaps in a time-series that may have occurred because data arrived late.

In cylc-7 I would simply stop the workflow, update the start time in the rose-suite.conf file and run again (without a --restart flag). It runs cold-start again but that’s fine, sometimes even desirable. How do I do this in cylc-8?

In cylc-8 it appears the workflow can only restart from where it left off, which is not what I want, I want to rewind it a few cycles.

Hi,

Cylc 8 workflows will always restart where they left off. To re-run all or part of a workflow at Cylc 8, use “concurrent flows”:

https://cylc.github.io/cylc-doc/stable/html/user-guide/running-workflows/reflow.html

E.G. if you want to re-wind a few cycles you might do this:

# start a new execution of the workflow starting from the selected task(s)
$ cylc trigger --flow=new <task(s)-to-start-from>

# stop the original execution of the workflow
$ cylc stop --flow=1

Where <task(s)-to-start-from> might be the first task in the cycle you want to rewind to e.g. 2000/foo. Note you can specify multiple tasks here if needed.

Note: We are currently working on an enhancement which would save you having to work out the start-of-cycle task(s) which may arrive in Cylc 8.3.0 if we manage to get it done in time.

Thanks. Should this work if the workflow happens to have stalled?

At first glance the answer is no as I tried it on a stalled workflow and nothing happened, but maybe there is something else I need to do? Besides, I have tried in vain to unstall my workflow but it descended into a never-ending spiral of older unsatisfied pre-requisites. If only I could start it fresh from an older cycle it would solve that problem easily too.

Yes. this will also work if your workflow has stalled.

I have now decided that my best course of action is just to delete the .service/db file and start the workflow again, just like I used to with cylc7. I guess the difference is that it used to clobber the db on a new start whereas now it doesn’t so you have to do it manually.

I am though, still be interested in whether the concurrent flow would be expected to work after a stall or whether you have to clear that first. Thanks

Ok, it should work, that’s good to know.

For the record, Cylc 8 dos not allow a new cold-start in an existing run directory for provenance and safety reasons. If you accidentally get the new start date wrong, or if you really meant to do a restart (cold start is unfortunately the default in Cylc 7) then you will overwrite existing data and wipe out the true current state of your workflow. And even if you did intend that, and got the start date right, there will be a period where the run directory contains an unholy mix of old and newly-generated data.

Cylc 8 is actually far more flexible in this regard, although we haven’t quite finished off the user interface to make it as easy as possible.

  • you can restart from the existing state
  • or you can cold-start again from a previous cycle, in a new run directory
  • (or you can, if you really want to, overwrite the old one; just deliberately remove its .service directory first)
  • you can “rewind” the existing workflow at runtime, by starting a new flow - not just at a previous cycle point, but anywhere in the graph. The new flow can coexist with, or replace, the original flow. This is the really powerful new feature, but if possible you should wait for the user interface changes and documentation.

A stall is caused by an incomplete task - i.e. one that did not complete its required outputs. To unstall it you need to either complete the task (e.g. by re-triggering it after fixing it) or remove it (i.e. tell the scheduler to forget it, even though it is incomplete).

I’m not sure what you mean by “a never-ending spiral of older unsatisfied prerequisites”. If you want to post the form of your graph and exactly what commands you used, I’m sure we could figure out what happened there.

Note that if triggering new flows you need to be aware of self-consistent (with a flow) prerequisite satisfaction. The prerequisites of a flow=2 task, for instance, cannot be satisfied by the outputs of a flow=1 task (if it could, every task in flow=2 started back in the graph would start running immediately because flow=1 had already passed through and satisfied all the prerequisites).

I was struggling to understand it myself! I don’t have a graph to show you unfortunately because I clobbered the database as described and it is now happily chugging away. I don’t know why it stalled in the first place but the sequence went something like this…

Over the weekend, while I wasn’t watching, it shut down unexpectedly. No big deal, I fire it up again on Monday but it soon got into a stalled state. The scheduler log showed a lot of waiting tasks with unsatisfied prerequisites which I know had already run because I have the outputs. Sifting through the massive pile of tasks I find what I think is the oldest unsatisfied dependency and set it to completed. This then causes other tasks (which have already run) to enter the fray and claim they are waiting for other prior tasks to complete…so I set them to completed…and so on. Basically a recursive stall.

I assumed I wouldn’t have to go back far but I never reached a point where it wouldn’t draw in older tasks. After a few hours of firing off cylc commands to try and rescue it without brute force I gave up and decided to cold-start.

A new cold-start will not be a good solution for a lot of workflows but this one has been designed to work from any realistic cycle point so it’s actually a solution of first resort, I just thought I’d try and sort it out ‘mid-flow’ this time. Perhaps I did something wrong, but it’s not obvious what.

Thanks for your comments