How to allow next cycle to progress when current cycle has failed tasks

Hi Cylc team,

I’m running operational forecasting workflows and I’m looking for the right pattern for this behavior:

What I want:

  1. Keep strict within-cycle dependency logic.
  2. If an upstream task fails, downstream tasks that require it should not run.
  3. Failed/unmet tasks in cycle N should not block scheduling/progression to cycle N+1.
  4. I want to avoid manual intervention (manually removing blocked tasks).

Current behavior:

  • I already set:
    • abort on stall timeout = False
  • This prevents workflow abort, but the workflow still stalls/blocks on the same cycle due to unmet required outputs in the graph.

What I tried:

  1. “completion = succeeded or failed” at [runtime][root] level.
  • Validation fails because some tasks are required as :succeeded in graph outputs.
  • Error example:
    • “:succeeded is required in the graph, but optional in the completion expression succeeded or failed”
  1. Optional outputs (?) in graph.
  • I understand:
    • A => B? means B still requires A:succeeded to run, but B is optional for workflow completeness.
    • A? => B would relax dependency (not what I want).
  • This helps in some places, but in larger chained graphs it becomes complicated to apply safely without making graph definitions messy.

Core question:
Is there a global or cleaner Cylc-native pattern to treat a cycle as “complete with failures” (without stalling) for progression purposes, while still preserving strict within-cycle success dependencies (i.e., no downstream execution when hard upstream failed)? I.e., in a forecast workflow scenario, I want to allow my next cycle to carry on even if there are failed tasks in the current cycle.

If the recommended approach is graph-based only, is there a best-practice pattern for large workflows to avoid lots of repetitive optional/suicide logic?

Thanks in advance.

I’m not sure what you mean by that - you don’t want or need any inter-cycle dependencies?

It’s certainly always true that a downstream task will not run if it depends on the success of an upstream task that failed.

A failed task can recede into the past just like a succeeded one so long as the workflow is written to handle the failure automatically: optional success (foo?) and hence optional failure, means either the failure is handled explicitly (foo:failed? => bar) or implicitly (it’s OK for the graph to simply dead-end there on failure, i.e.the products of that task are not critical to the purpose of the workflow).

Otherwise, success of the task is required by the workflow and its failure is by definition an event that the workflow does not handle. As such, Cylc retains the “incomplete” task in the active window pending manual intervention to tell it what to do.

Even an incomplete failed task does not immediately prevent progression though, unless everything else depends (directly or indirectly) on its success, in which case there is nothing else the scheduler can do. Otherwise the workflow will eventually stall at the runahead limit, r cycles ahead of the incomplete task, if the runahead limit is set to r cycles.

So it sounds like you are getting a runahead stall due to getting too far ahead of some incomplete failed tasks that require manual intervention?

If the success of these tasks is not really required, just mark them as optional - problem solved.

If their success is required, but you have not told Cylc how to handle their failure, then Cylc has no choice but to await manual intervention. But you could up the runahead limit to allow more of them to accumulate before the workflow stalls - would that be sufficient for your needs?

Note that in a potentially infinite cycling workflow Cylc can’t just accumulate incomplete tasks indefinitely, until memory is exhausted - that’s more or less what the runahead limit is for.

No, it doesn’t relax the dependency - B will only run if A succeeds. It means success of A is not required so the workflow won’t stall if A fails. See Flaky Pipelines.

@Rafael_Soutelino - if you already read my response above, please revisit it - I rewrote it for clarity and brevity (I hope).

That would be bad - it tells Cylc that it’s OK for all tasks to fail. In the worst case scenario, they could all fail and the workflow would shut down as successfully run to completion without doing anything useful!

Yes, if you said that :succeeded is optional for all tasks then for consistency those outputs must be marked optional everywhere they appear in the graph (specifically, on the left side of trigger expressions: upstream-task:output? => downstream-task).

That is short for A:succeeded? => B which as @dpmatthews noted means precisely (1) B depends on success of A, and (2) success of A is optional - nothing more.

Yes, but it is perhaps more clearly written like this (because tasks trigger off of upstream outputs):

A => B  # B triggers off of A:succeeded, and A:succeed is required
B?  # B:succeeded is optional
# or if something else triggers off of B
B? => C

Given what I said above in my first response, the proper way to do this is to use optional outputs properly, or perhaps if the success of these tasks really is necessary to the workflow but you can’t intervene frequently (??) to extend the runahead limit to allow the workflow to progess further before stalling.

I don’t see how Cylc could do more than that, given that you have (evidently) not told it how to handle the failures automatically.

Thanks @hilary.j.oliver and @dpmatthews for the thorough answers.

Yes, what I really want to avoid is frequent intervention, but I didn’t have the right idea around Cylc expecting failures to be explicitly handled. I’ll paste a snippet of my workflow to make things more clear.

[scheduler]
    UTC mode = True
    # allow implicit tasks = True
    [[events]]
        stall timeout = PT3H  # Longer than retry delays (2h max), catches genuine stalls
        inactivity timeout = P1D  # the longest expected gap (@delay24h)
        abort on stall timeout = False # Logs the stall warning but keeps running
                                       # Alerts to problems without killing the workflow
                                       # Next cycle triggers even if current is stalled??
        # Visibility into stuck tasks while maintaining operational continuity

[scheduling]
    runahead limit = P1
    initial cycle point = previous(T00:00)

    [[xtriggers]]
        delay4h = wall_clock(offset=PT4H)
        delay6h = wall_clock(offset=PT6H)
        delay12h = wall_clock(offset=PT12H)
        cmems_hourly_done = workflow_state("OP/test/cmems//%(point)s/transform_hourly:succeeded"):PT10M
        cmems_merged_done = workflow_state("OP/test/cmems//%(point)s/transform_merged:succeeded"):PT10M
    [[graph]]
        PT24H = """    
            # ROMS
            @cmems_hourly_done & @cmems_merged_done => roms_sebrazil-spinup & roms_nz-no-tide-spinup & roms_nz-spinup & roms_saus2d & roms_gulf-spinup & roms_taiwan-spinup & roms_bangladesh
            roms_nz-no-tide-spinup => roms_nz-no-tide
            roms_nz-spinup => roms_nz => roms_bop-spinup => roms_bop
            # SCHISM
            @delay12h => transfer_rivers
            transfer_rivers & roms_nz => schism_gtbay
        """

The way I’ve been handling failures is by adding scheduler and task events in global.cylc, as below:

[platforms]
    [[kapua]]
        hosts = localhost
        job runner = slurm
        install target = localhost
        retrieve job logs = True


[scheduler]
    [[run hosts]]
        available = kap-xeon05, kap-xeon06, kap-xeon07, kap-xeon10
        ranking = """
            getloadavg()[2] < 10

            virtual_memory().available
        """
        # ssh command = ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no

    [[events]]
        stall handlers = jira_alert_workflow.sh %(workflow)s %(event)s %(host)s

[task events]
    handler events = failed
    handlers = jira_alert_task.sh %(workflow)s %(event)s %(platform_name)s %(name)s %(point)s

Those send alerts to Jira so we are notified. Does Cylc not count this as handling failure? I.e., do I need to set the failure handling explicitly in the graph section by using ?

OK this why your workflow stalls pretty quickly on incomplete failed tasks. A runahead limit of P1 means only 2 cycles can be active at once. “Active” means where there are tasks present in the active “n=0” window of the workflow. The default runahead limit is 5 cycles. Is there some reason why you’ve strapped this down to P1?

The comment is wrong - the workflow should not stall if there are tasks in the active window still waiting to retry (or in fact still waiting on clock or xtriggers).

Yes a stalled scheduler will stay alive, but it still can’t run anything until you intervene - which makes the last comment wrong. Note you can restart a stopped workflow at will, so there’s arguably no point in keeping a stalled scheduler alive indefinitely - unless perhaps you want to distinguish operational workflows from any number of stopped ones by their current running status.

OK you’ve revealed a bit of a terminological ambiguity here! These event handlers do something (such as send an alert) in response to task events such as task started or succeeded or failed.

That’s different from “handling” a task failure itself in the graph in order to automatically remedy the effects of the failure so that the workflow can continue toward successful completion.

This example graph covers all of the “handling” cases:

# model_1 products are critical to the success of the workflow.
# If it fails something is wrong that needs fixing by humans.
# So the :succeeded output is required to complete the task.
model_1 => products_1 => publish_1

# model_2 products are also critical to the success of the workflow, but we
# know how generate them even if it fails (e.g. run a short timestep model).
# so :succeeded and :failed are optional - the task is complete either way
model_2? => products_2
model_2:failed? => model_2a => products_2a  
products_2 | products_2a => publish_2

# model_3 products are NOT critical, publish if we can, but if not that's OK
# so :succeeded is optional, but the :failed path is just do nothing
model_3? => products_3 => publish_3
  • model_1 must succeed, otherwise the failed task will be retained in the active window pending intervention. (e.g. fix a bug and re-trigger it to succeed, or remove it from the active window and take responsibility for the missing products.)

  • model_2 is completed by “succeeded or failed” - either way it has done its job within the workflow and can safely recede into the past.

  • model_3is also complete either way; but its failure path is to do nothing, i.e. generate the products if possible but it’s OK if we can’t do that.

In Cylc, task output completion is more important than task success and failure - reflecting the fact that workflows can be designed to automatically handle failures. In which case (in the context of the workflow) those failure are much less important than other (unhandled) failures.

Hopefully that’s clear?

It’s not obvious from what you’ve said which of your tasks, if any, should be success-optional.

If the failing tasks are not critical, mark them as optional (like model_3 above) and the workflow will deem their failure to be unimportant - they won’t be retained pending intervention to fix the problem.

If they are critical tasks, set the runahead limit higher if you don’t want them to stall the workflow so quickly (assuming that other parts of the workflow do not depend on their success!)

Hilary

Right! That’s what I meant by not having the right idea around Cylc expectations on handling failures. I was indeed considering my alerts as a form of error handling. So now everything is more clear :slight_smile: .

Thanks for the thorough and prompt help again! I think I have a way forward while aligning with recommended practises, I’ll see what makes sense here in terms of what is success-optional and go from there.

And yes, that’s a small run-ahead limit, but I was keeping it short on purpose to study the stall behaviours more quickly. The intent is to make them higher going forward.

Thanks again!

1 Like

Yeah I can see why you might think that. Maybe we need to document this stuff better.

It’s true that if you receive an alert (via event handling) on failure, you could go back and fix the problem and retrigger the task (allowing dependent tasks to run) even if it had receded into the past as complete.

However, that’s not safe, because you might miss the alert or forget to do anything about it - and then the workflow will happily carry on without executing it and whatever depends on it. In the worst case, your workflow might run to apparent successful completion very prematurely (or even without doing anything at all) as a result of a such failures.

So that’s why we retain failed tasks in the active window and require that they are dealt with (unless their success is optional) in order for the workflow to complete successfully. (Or in the case of cycling workflows, in order for them to continue indefinitely on to future cycles).

1 Like