Stall timer activating for failed tasks in no detach mode

Hi,

We have a CI system that runs Cylc through a GitHub action in no detach mode i.e. with cylc play --no-detach experiment_name. When a task fails we get the following message:

2023-08-21T13:28:14Z INFO - [1/CloneJedi running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-21T13:28:14Z INFO - [1/CloneJedi running job:01 flows:1] => failed
2023-08-21T13:28:14Z WARNING - [1/CloneJedi failed job:01 flows:1] did not complete required outputs: ['succeeded']
2023-08-21T13:28:15Z ERROR - Incomplete tasks:
      * 1/CloneJedi did not complete required outputs: ['succeeded']
2023-08-21T13:28:15Z CRITICAL - Workflow stalled
2023-08-21T13:28:15Z WARNING - PT1H stall timer starts NOW
2023-08-21T14:28:15Z WARNING - stall timer timed out after PT1H
2023-08-21T14:28:15Z ERROR - Workflow shutting down - "abort on stall timeout" is set
2023-08-21T14:28:18Z INFO - DONE

And then the system hangs for an hour before sending the failure back to GitHub. Is that the expected behaviour with --no-detach? It seems a bit wasteful to wait an hour when the task has already failed, especially for jobs using multiple nodes. Is there an option we are missing? Or if not is it safe to reduce this timeout from 1hour to 1minute, say?

Hi @Dan_Holdaway

For the scheduler to stay up for an hour if stalled (which means, there is nothing more to run but there are incomplete tasks present, so something has gone wrong) is the default, to allow some time for user intervention to correct the situation without having to restart the workflow.

You raise a good point - no-detach mode is often used in CI, in which case an immediate abort on stall is likely to be desirable. I’m not sure it’s a slam dunk though (in the sense that we may use no-detach mode for other reasons too).

In Cylc project functional testing, with no-detach mode, we typically put the following in workflow config:

[scheduler]
    [[events]]
        abort on stall timeout = True
        stall timeout = PT0S  # (the default is PT1H)

And often also this:

        abort on inactivity timeout = True
        inactivity timeout = PT1M  # (be careful with this one)

Thanks for the tips @hilary.j.oliver , this is very helpful. I could see there being other applications for no detach so that makes sense.

Thanks again,
Dan