Suite stuck after fixing graph issue

I have a workflow (full DA cycling workflow) that got stuck because I had missed a connection in my graph. After fixing it, I restarted, but got an error.

2023-05-15T01:21:43Z ERROR - ('20230314T0600Z', 'sy_um_fcst_ss_06', 'succeeded')
    Traceback (most recent call last):
      File "/g/data/dp9/cst565/conda_envs/productive/lib/python3.9/site-packages/cylc/flow/scheduler.py", line 687, in start
        await self.configure()
      File "/g/data/dp9/cst565/conda_envs/productive/lib/python3.9/site-packages/cylc/flow/scheduler.py", line 476, in configure
        self._load_pool_from_db()
      File "/g/data/dp9/cst565/conda_envs/productive/lib/python3.9/site-packages/cylc/flow/scheduler.py", line 737, in _load_pool_from_db
        self.workflow_db_mgr.pri_dao.select_task_pool_for_restart(
      File "/g/data/dp9/cst565/conda_envs/productive/lib/python3.9/site-packages/cylc/flow/rundb.py", line 873, in select_task_pool_for_restart
        callback(row_idx, list(row))
      File "/g/data/dp9/cst565/conda_envs/productive/lib/python3.9/site-packages/cylc/flow/task_pool.py", line 502, in load_db_task_pool_for_restart
        itask_prereq.satisfied[key] = sat[key]
    KeyError: ('20230314T0600Z', 'sy_um_fcst_ss_06', 'succeeded')
2023-05-15T01:21:43Z CRITICAL - Workflow shutting down - ('20230314T0600Z', 'sy_um_fcst_ss_06', 'succeeded')

After some finagling, I managed to get the suite started again, long enough to trigger tasks. But my suite refused to recognise that dependencies were met and continue the current flow. I had to very manually trigger the next cycle, much like Continuing a run that completed - Cylc / Cylc 8 Migration - Cylc Workflow Engine except in this case my run was nowhere near completed.
Q1. Is there any good way to fix/work around a key error like this?

I tried reproducing in a simple suite, but it did not give an error. However, it also refused to recognise the fulfilled prerequisites and continue on. I’m not sure if this is a bug, or a feature, or if it has been fixed since 8.1.2.

Here is the example.
Original graph with mistake:

[scheduling]
    initial cycle point = 20230101T06
    final cycle point = 20230103T00
    initial cycle point constraints = T00, T06, T12, T18
    runahead limit = P3
    [[graph]]
        R1 = install => setup_6
       PT1H = """
          housekeep[-PT1H] => task_6?
          setup_6 => task_6? => housekeep
          task_6:fail? => task_6_ss
       """

Forcing task_6 (e.g. a forecast task) to fail, task_6_ss (e.g. a shortstep forecast) runs successfully. But housekeep doesn’t run because it has unmet prerequisites.
If I fix the graph

        PT1H = """
            housekeep[-PT1H] => task_6?
            setup_6 => task_6?
            task_6:fail? => task_6_ss
           task_6? | task_6_ss => housekeep
        """

And reload/restart, housekeep should now be allowed to run. However, housekeep does not run and does not appear as an active task. No further tasks run therefore run.
Q2. Why does housekeep not recognise that it is now able to run, or why does it lose its status as an active task after changing the graph?

Hi,

This issue was fixed in Cylc 8.1.3.

Cheers,
Oliver

Hi @srennie

I wonder if you’ve run into a more subtle problem too - in addition to the (fixed) outright bug flagged by @oliver.sanders.

Changing the graph structure in a way that affects current active tasks is risky, because it might not be easy to figure out what should happen automatically. We are trying to make this sort of thing work wherever the correct response is clear, and I’m sure there’s more to do in that regard, but in general you should expect to have to do some manual triggering to get things going again.

I’ll attempt to reproduce your example and figure out what’s happening, later on today if I can find the time.

(In case it’s not obvious by now, that’s a bug and the way to work around it is to upgrade Cylc; and until then don’t change the dependencies of a task already in the active window of the workflow).

OK, I ran your example, which does indeed illustrate why changing (and reloading) the graph structure in or near the active window is tricky.

My full flow.cylc:

[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 20230101T06
    final cycle point = 20230103T00
    runahead limit = P0  # restrict activity to one cycle point
    [[graph]]
        R1 = install => setup_6
        PT1H = """
            housekeep[-PT1H] => task_6?
            setup_6 => task_6?
            task_6:fail? => task_6_ss
            task_6? => housekeep  # OOPS
            # task_6? | task_6_ss => housekeep  # FIX
        """
[runtime]
    [[task_6]]
         script = false  # make task_6 fail

When I run this, the workflow stalls after a short while with this warning:

WARNING - Partially satisfied prerequisites:
      * 20230101T0700Z/task_6 is waiting on ['20230101T0600Z/housekeep:succeeded']
CRITICAL - Workflow stalled
WARNING - PT1H stall timer starts NOW

I.e., the scheduler has nothing left to run (according to the graph, given what transpired at run time) but there must be something wrong because the prerequisite tracking system has one partially satisfied task: T07/task_6 is waiting on T06/housekeep:succeeded.

The following graph illustrates the state of the workflow at this point:

The fuzzy dots indicate what happened in the past (grey: succeeded, red: failed but not a cause for concern as success was marked optional for task_6).

The hard blue dots indicate that T08/setup_6 was spawned as ready to go but held back by the runahead limit because of the partially satisfied task T07/task_6 (satisfied prerequisite marked small blue dot).

What you should see in the UI at this point is T08/setup_6 and T08/task_6 which is one graph edge from it. That’s the n=1 window around the “active” tasks. The fact that T06/housekeep shows up too is probably a small bug in our n-window computation (it should have been pruned from n=1 already when T06/task_6 finished) - but that doesn’t really matter. In Cylc 8 tasks surrounding the active ones are visible if the graph window is big enough, and not visible if not. That’s merely visualization (we can’t show all the tasks at once in an infinite workflow). You can trigger tasks, view logs, etc., anywhere in the graph regardless of that.

Note that T06/housekeep is essentially on a graph path that, in the (albeit recent) past was not taken, because events did not take us in that direction. However, Cylc can’t guess that you really wanted housekeep to run even if the parent task failed.

(setup_6 can run in every cycle - subject to runahead limiting - because it has no prerequisites).

Now if I change the dependencies of housekeep to those marked by # FIX above, as far as the T06 task instance is concerned all I am doing is changing the prerequisites of a past task on a path that was not taken in the graph. You will definitely want future instances of the task to respect those dependencies, but I don’t think you can expect Cylc to second-guess what already happened in the past under the old graph config (what if it was further back in the past; and how far back should we go to check if it would affect historical tasks …).

The good news is, with your God-like knowledge of how you want the workflow to carry on once you’ve changed the graph, all you need to do is manually trigger the housekeep task. [Tested - it works :tada: ] Again, if that task is not appearing in the UI, that’s just the visualization window size. Use cylc trigger at the command line instead (we haven’t yet made the window size update instantly in the UI if you expand it, but that’s coming).

Note I’ve gone to great lengths above to try to explain exactly what you’re seeing. But in practice you don’t really need to understand that. Long story short:

If you change and reload the graph structure

  • the new structure will apply to all future activity
  • (except current submitted and running tasks, which were already active with the old settings)
  • if the reload doesn’t fix a stall (for reasons like the above) just manually trigger the stuck task(s) to carry on under the new graph structure

Thanks for checking this out thoroughly, @hilary.j.oliver.
I will be glad of my omnipotence, as it seems like in a larger suite, I had to trigger a lot more than just housekeep to get things rolling again. And also remembering to specify the right flow! :smiley:
Hopefully I will be using a later cylc version in future trials - I don’t think I can upgrade mid-trial and I’m not going to try.

For the case above, there is no need to specify a flow number because the default (for cylc trigger is to assign the triggered task to the current flow (usually flow 1) which is exactly what I think you’d want here.

That should only have been necessary if you had multiple tasks in the same situation (partially satisfied prerequisites) as the one above.

From your comment on flows above, I wonder if you triggered other flows initially? Then, in this sort of situation you might have to deal with partially satisfied tasks in several flows at once.

However, perhaps it was a result of the actual (now-fixed!) bug that caused the traceback above.

Up to you, but the difference between 8.1.0 and 8.1.4 is only a whole bunch of bug fixes - so I’d do it!

I’ve definitely ended up with a few flow=none where I forgot to specify, and the suite had stalled. They were in the next cycle past the runahead, which should have been ready to go once housekeep was run and that cycle completed (as well as the next N cycles). Hmm, maybe that is a difference, that housekeep actually is not a prerequisite for the next cycle in my NWP suite, but holds things up for runahead.

I don’t think I can upgrade mid-trial and I’m not going to try.

Upgrading between Cylc versions in the same series (i.e. upgrading from 8.1.2 to 8.1.4) is completely safe. I would strongly recommend doing this.

# with the new version of Cylc
$ cylc stop <id>
$ cylc play <id>

A future version of the GUI will include a button to automatically restart workflows with more recent versions of Cylc to make this easier.

1 Like