Execution retry delays and execution job polling


I’m running a suite in Cylc 7.8.4 and noticing something that I believe is a bug, but I can’t seem to replicate and assume it must be a race condition. I have a suite, it has tasks which have execution retry delays on, but aren’t reliably retrying when failing. The task itself does not have execution polling on, but the global config does for the host the job is being submitted to. My best guess is that the lack of a retry is because the task is failing due to walltime at the same time as a execution poll happens, leading to some weird behaviour under the hood.

From cylc get-config for an example task:

            execution retry delays = PT30S, PT30S
            submission retry delays = PT1M, PT1M, PT1M
            batch system = pbs
            execution time limit = PT10M
            shell = /bin/bash
            execution polling intervals =
            submission polling intervals =
            batch submit command template =

and the site-config for the host that job submits to:

        use login shell = False
        retrieve job logs = True
        retrieve job logs retry delays = PT10S, PT30S, PT1M, PT3M
        retrieve job logs max size = 40M
        execution polling intervals = PT10M, PT10M, PT10M, PT20M, PT20M, PT20M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M
        submission polling intervals = PT2M, PT5M, PT5M, PT10M, PT10M, PT20M, PT20M, PT30M, PT30M

From the log:

2021-02-03T06:57:47Z INFO - [defence_060_gl.20210203T0000Z] -submit-num=01, owner@host=host
2021-02-03T06:57:49Z INFO - [defence_060_gl.20210203T0000Z] status=ready: (internal)submitted at 2021-02-03T06:57:49Z for job(01)
2021-02-03T06:57:49Z INFO - [defence_060_gl.20210203T0000Z] -health check settings: submission timeout=None, polling intervals=PT2M,2*PT5M,2*PT10M,2*PT20M,2*PT30M,...
2021-02-03T06:57:54Z INFO - [defence_060_gl.20210203T0000Z] status=submitted: (received)started at 2021-02-03T06:57:53Z for job(01)
2021-02-03T06:57:54Z INFO - [defence_060_gl.20210203T0000Z] -health check settings: execution timeout=PT20M, polling intervals=PT10M,PT1M,PT2M,PT7M,...
2021-02-03T07:07:54Z INFO - [defence_060_gl.20210203T0000Z] -poll now, (next in PT1M (after 2021-02-03T07:08:54Z))
2021-02-03T07:07:55Z CRITICAL - [defence_060_gl.20210203T0000Z] status=running: (received)failed/TERM at 2021-02-03T07:07:54Z for job(01)
2021-02-03T07:07:55Z INFO - [defence_060_gl.20210203T0000Z] -job(01) failed, retrying in PT30S (after 2021-02-03T07:08:25Z)
2021-02-03T07:07:56Z INFO - [defence_060_gl.20210203T0000Z] status=retrying: (polled)failed at 2021-02-03T07:07:54Z for job(01)
2021-02-03T07:07:56Z INFO - [defence_060_gl.20210203T0000Z] status=retrying: (polled)failed/TERM at 2021-02-03T07:07:54Z for job(01)
2021-02-03T07:07:56Z CRITICAL - [defence_060_gl.20210203T0000Z] -job(01) failed

That to me looks like it receives two failure messages, one from the task and one from the poll. Basically, job polled, job fails (task failure), goes into retry state, poll result comes back (fail), state isn’t submit or running so the next state to go to is fail.

Am I correct? Any recommendations on avoiding this race condition (if I’m correct) - or is it fixed in later versions of Cylc?

I think you’ve hit this bug: Failure trigger activated incorrectly for a retrying task due to polling · Issue #3460 · cylc/cylc-flow · GitHub
It doesn’t appear to affect Cylc 8 but I’m afraid we don’t have a fix for Cylc 7 at the moment :frowning_face: