Canonical method for in-cycle catchup

We now have several applications with the following use case, where each cycle of the suite will contain the need to flexibly rewind a particular graph subsection several cycles ago.

  1. A data-assimilative cycling suite with dependencies on both incoming observations (divided between low and high latency) and upstream model forcing (low latency)
  2. Product delivery demands that the suite runs with wallclock delays consistent with the low latency observations and upstream model forcing.
  3. At the beginning of each cycle, go back a specified time window and cycle qc/da/forecast tasks incorporating the higher-latency observations
  4. The results of the “rewound” analyses including both high and low-latency data form the basis for the current cycle’s forecast initial condition.

This has come up both for specialized cycles with cycling intervals as short as 15 minutes (where the “high latency” is only several hours) to other more traditional PT6H-cycling suites with much longer latency (24+ hours) data.

What’s a canonical way that others have handled this? I’ve toyed with the idea of having a task that registers a separate suite that would have the suite controller run non-daemonized within a task and submit its own jobs, but that makes monitoring tough. I know there are some discussions about things in the works for future versions which should make this cleaner, but for now this is in the 7.x world.

1 Like

Hi Tim,

That is quite a complicated requirement for dynamic structural change to the workflow. We do indeed hope to be able to do this sort thing more elegantly in the future (Cylc 9?) but for now … in my experience (others may differ … Met Office may have requirements like this?) I have not seen real-world examples of this, and I’m not aware of any “canonical” way of doing it.

Personally I would try sub-suites (i.e. non-detaching suites running inside tasks in the top level suite, as you’ve noted). That doesn’t necessarily make monitoring tough: register sensible sub-suite names, maybe have them all collapse into a common “group” in the gscan GUI, and house-keep carefully in the main suite to avoid profileration of sub-suite run directories (sub-suites will typically have top-level cycle point in their suite name). Note also the current PR intended to better support some kinds of sub-suites: https://github.com/cylc/cylc-flow/pull/2935 (and a back-port for cylc-7).

Otherwise, to do it all in the main suite I think you’d need a good understanding of the “task pool”, and a task that uses the CLI to back-populate the pool with waiting proxies (or resets existing proxies to waiting, if they are still in the pool) for the whole sub-graph before triggering it. That might not be too difficult, but definitely the sort of thing to test in a simple dummy suite first!

Hilary

I think it depends how complicated the graph sub-section is that you need to “rewind” (repeat). If it’s only a few tasks then you maybe able to just add the necessary tasks to the current cycle to perform the catchup. This would be a lot easier to manage that a spawning a separate sub-suite. I wouldn’t recommend trying to rewind the suite.

(It wasn’t entirely clear to me if the sub-graph spans multiple cycles - “flexibly rewind a particular graph subsection several cycles ago” suggest it might? - but if not then I would tentatively agree with @dpmatthews).

I’m glad to hear that there’s not something simple I was missing.

In this case, we expect that the sub-graph would indeed spawn multiple cycles. I agree with @dpmatthews that trying to literally rewind the suite is not recommended, but I’ll need to check with those that had the requirement to see just how detailed the repeating cycle needed to be. At its most general, it’s almost a recursive operation where each cycle point launches a suite with the same set of tasks but with an ICP of $CYLC_TASK_CYCLE_POINT-(whatever the catchup delay is) and an FCP of $CYLC_TASK_CYCLE_POINT.

If we went the route of adding all the catch-up tasks to the current cycle, we’d need to do something to over-ride or otherwise mess with the cycle point passed to tasks, but that’s another option I’ll look at.

As an aside - philosophically, these catch-ups are different than just re-adding previous tasks to the queue - their job is to pick up late-arriving observations, so it matters if a task is an actual re-run (i.e. there was a problem with the execution) or if it’s executing as part of the catchup/rewind part (where rerunning is expected and it seems clearer to have a separate task name). That seems to suggest more of the subsuite approach. We’ll play with some things.