Hi,
We have a CI system that runs Cylc through a GitHub action in no detach mode i.e. with cylc play --no-detach experiment_name
. When a task fails we get the following message:
2023-08-21T13:28:14Z INFO - [1/CloneJedi running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-21T13:28:14Z INFO - [1/CloneJedi running job:01 flows:1] => failed
2023-08-21T13:28:14Z WARNING - [1/CloneJedi failed job:01 flows:1] did not complete required outputs: ['succeeded']
2023-08-21T13:28:15Z ERROR - Incomplete tasks:
* 1/CloneJedi did not complete required outputs: ['succeeded']
2023-08-21T13:28:15Z CRITICAL - Workflow stalled
2023-08-21T13:28:15Z WARNING - PT1H stall timer starts NOW
2023-08-21T14:28:15Z WARNING - stall timer timed out after PT1H
2023-08-21T14:28:15Z ERROR - Workflow shutting down - "abort on stall timeout" is set
2023-08-21T14:28:18Z INFO - DONE
And then the system hangs for an hour before sending the failure back to GitHub. Is that the expected behaviour with --no-detach
? It seems a bit wasteful to wait an hour when the task has already failed, especially for jobs using multiple nodes. Is there an option we are missing? Or if not is it safe to reduce this timeout from 1hour to 1minute, say?