Stall timer activating for failed tasks in no detach mode

Dan_Holdaway · August 21, 2023, 3:44pm

Hi,

We have a CI system that runs Cylc through a GitHub action in no detach mode i.e. with cylc play --no-detach experiment_name. When a task fails we get the following message:

2023-08-21T13:28:14Z INFO - [1/CloneJedi running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-21T13:28:14Z INFO - [1/CloneJedi running job:01 flows:1] => failed
2023-08-21T13:28:14Z WARNING - [1/CloneJedi failed job:01 flows:1] did not complete required outputs: ['succeeded']
2023-08-21T13:28:15Z ERROR - Incomplete tasks:
      * 1/CloneJedi did not complete required outputs: ['succeeded']
2023-08-21T13:28:15Z CRITICAL - Workflow stalled
2023-08-21T13:28:15Z WARNING - PT1H stall timer starts NOW
2023-08-21T14:28:15Z WARNING - stall timer timed out after PT1H
2023-08-21T14:28:15Z ERROR - Workflow shutting down - "abort on stall timeout" is set
2023-08-21T14:28:18Z INFO - DONE

And then the system hangs for an hour before sending the failure back to GitHub. Is that the expected behaviour with --no-detach? It seems a bit wasteful to wait an hour when the task has already failed, especially for jobs using multiple nodes. Is there an option we are missing? Or if not is it safe to reduce this timeout from 1hour to 1minute, say?

hilary.j.oliver · August 21, 2023, 10:17pm

Hi @Dan_Holdaway

For the scheduler to stay up for an hour if stalled (which means, there is nothing more to run but there are incomplete tasks present, so something has gone wrong) is the default, to allow some time for user intervention to correct the situation without having to restart the workflow.

You raise a good point - no-detach mode is often used in CI, in which case an immediate abort on stall is likely to be desirable. I’m not sure it’s a slam dunk though (in the sense that we may use no-detach mode for other reasons too).

In Cylc project functional testing, with no-detach mode, we typically put the following in workflow config:

[scheduler]
    [[events]]
        abort on stall timeout = True
        stall timeout = PT0S  # (the default is PT1H)

And often also this:

        abort on inactivity timeout = True
        inactivity timeout = PT1M  # (be careful with this one)

Dan_Holdaway · August 22, 2023, 2:32am

Thanks for the tips @hilary.j.oliver , this is very helpful. I could see there being other applications for no detach so that makes sense.

Thanks again,
Dan

Topic		Replies	Views
How to handle workflow stalling Cylc Support	2	203	July 20, 2023
Showing workflow state after run has finished Cylc Support	3	206	July 3, 2023
What setting keeps cylc workflows active? Cylc Support	2	169	October 24, 2023
Running Cylc in CI Cylc Support	2	306	July 26, 2022
Repetitive timeouts in the UI server Cylc Support	8	29	August 5, 2024

Stall timer activating for failed tasks in no detach mode

Related topics