Graph branching vs. suicide trigger

I’m trying to handle a situation where I don’t really need to perform recovery on a failed task, I just need to not run the downstream tasks. Here is the setup

        PT6H = """
           @clock_A => A
            B & A => C => D
        """

where A fails because a file is intentionally missing (i.e. there are no tropical cyclones). In this case I don’t want to run C or D and there are no dependencies downstream these two tasks. If I do nothing, the suite stalls at the runahead limit. So, I either need to set the outputs of C and D to succeed with a recovery task, or add two suicide triggers. Unless I’m missing something, the later seems more concise since it can be done directly in the graph.

        PT6H = """
            @clock_A => A?
            B & A? => C => D
            A:fail? => !C
            A:fail? => !D
        """

Am I missing something? What’s the downside of using a suicide trigger here, other than validator giving a warning?

(I’ll assume you’ve accidentally replaced D => E with C => D and I’ll go with the latter version).

The problem is that C depends on both optional and required outputs:

B & A? => C => D

In Cylc 8 (unlike in Cylc 7):

  • A itself does not need removal by suicide trigger if it fails (its success is marked optional, so the scheduler removes it automatically)
  • D does not need removal by suicide trigger if A fails (it does not get spawned into the active (n=0) window in the first place unless C succeeds, so there’s nothing to remove)

But:

  • C still needs removal by suicide trigger if A fails (at least for now, see below), because it still gets spawned into the active window by B:succeeded, whereupon (once A has failed) it will be stuck waiting forever on its other prerequisite A:succeeded.

In my opinion C should be removed automatically by the scheduler in this circumstance because if (B:succeeded, A:failed) eventuates at runtime the graph says:

  • don’t run C because its prerequisites are not satisfied (even though the parent tasks ran to completion - i.e. we’re not still waiting on them)
  • that’s one of the expected outcomes because A is permitted to fail (so nothing is wrong, no reason to leave the partially satisfied task to cause a stall)

However, we haven’t agreed on this in the team yet so you’ll have to keep one suicide trigger in the graph for now.

It might be helpful to know something about how Cylc 7 and 8 compare with respect to the need for suicide triggers.

The Cylc 7 “task pool” is kind of like the Cylc 8 “active window” - they both hold objects that represent all the tasks needed to feed the scheduling algorithm given current activity.
Once a task “spawns” into the pool (or the active window) it must either run to completion or else be forcibly removed to allow the scheduler to move on.

Cylc 7 pre-spawns loads of tasks ahead of time, and if it turns out they’re not needed at run time they have to be forcibly removed.

Cylc 8 does not pre-spawn. Whenever a task completes an output the scheduler spawns only the tasks that depend on that output. I.e. tasks get spawned on the fly only if/when they are needed.

This vastly reduces the need for cleaning up spawned tasks that won’t be needed at runtime, but not entirely because tasks that depend on multiple outputs get spawned by the first one and then wait in the active window for their other prerequisites - and there are edge cases (as you’ve discovered) that can result in tasks stuck with partially satisfied prerequisites.

Thanks for the quick reply and the detailed explanation. I think I understand now, spawning occurs in the active window when any dependent event output occurs, not all dependent event outputs.

1 Like

I’m going to suggest that where you have a known failure case it’s nicer to handle this explicitly - given this scenario I’d suggest

[scheduling]
  [[graph]]
    PT6H = """
      @clock_A => A
      B & A? => C =>
      A:no_cyclones? => !C

[runtime]
  [[A]]
    script = """
      # Whatever you were doing before plus....
      if [[ CYCLONES_FOUND == 0 ]]; then
        cylc message -- "INFO:No Cyclones"
      fi
    [[[outputs]]]
      no_cyclones = "No Cyclones"

That way if A fails for any other reason that the “No Cyclones” scenario you don’t just keep running.

As an additional advantage, looking at the workflow logs will reveal

<timestamp> INFO: No Cyclones

Feel free to make the message more informative.

3 Likes