Hi,
I’m running a suite in Cylc 7.8.4 and noticing something that I believe is a bug, but I can’t seem to replicate and assume it must be a race condition. I have a suite, it has tasks which have execution retry delays
on, but aren’t reliably retrying when failing. The task itself does not have execution polling
on, but the global config does for the host the job is being submitted to. My best guess is that the lack of a retry is because the task is failing due to walltime at the same time as a execution poll
happens, leading to some weird behaviour under the hood.
From cylc get-config for an example task:
[[defence_060_gl]]
[[[job]]]
execution retry delays = PT30S, PT30S
submission retry delays = PT1M, PT1M, PT1M
batch system = pbs
execution time limit = PT10M
shell = /bin/bash
execution polling intervals =
submission polling intervals =
batch submit command template =
and the site-config for the host that job submits to:
use login shell = False
retrieve job logs = True
retrieve job logs retry delays = PT10S, PT30S, PT1M, PT3M
retrieve job logs max size = 40M
execution polling intervals = PT10M, PT10M, PT10M, PT20M, PT20M, PT20M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M, PT30M
submission polling intervals = PT2M, PT5M, PT5M, PT10M, PT10M, PT20M, PT20M, PT30M, PT30M
From the log:
2021-02-03T06:57:47Z INFO - [defence_060_gl.20210203T0000Z] -submit-num=01, owner@host=host
2021-02-03T06:57:49Z INFO - [defence_060_gl.20210203T0000Z] status=ready: (internal)submitted at 2021-02-03T06:57:49Z for job(01)
2021-02-03T06:57:49Z INFO - [defence_060_gl.20210203T0000Z] -health check settings: submission timeout=None, polling intervals=PT2M,2*PT5M,2*PT10M,2*PT20M,2*PT30M,...
2021-02-03T06:57:54Z INFO - [defence_060_gl.20210203T0000Z] status=submitted: (received)started at 2021-02-03T06:57:53Z for job(01)
2021-02-03T06:57:54Z INFO - [defence_060_gl.20210203T0000Z] -health check settings: execution timeout=PT20M, polling intervals=PT10M,PT1M,PT2M,PT7M,...
2021-02-03T07:07:54Z INFO - [defence_060_gl.20210203T0000Z] -poll now, (next in PT1M (after 2021-02-03T07:08:54Z))
2021-02-03T07:07:55Z CRITICAL - [defence_060_gl.20210203T0000Z] status=running: (received)failed/TERM at 2021-02-03T07:07:54Z for job(01)
2021-02-03T07:07:55Z INFO - [defence_060_gl.20210203T0000Z] -job(01) failed, retrying in PT30S (after 2021-02-03T07:08:25Z)
2021-02-03T07:07:56Z INFO - [defence_060_gl.20210203T0000Z] status=retrying: (polled)failed at 2021-02-03T07:07:54Z for job(01)
2021-02-03T07:07:56Z INFO - [defence_060_gl.20210203T0000Z] status=retrying: (polled)failed/TERM at 2021-02-03T07:07:54Z for job(01)
2021-02-03T07:07:56Z CRITICAL - [defence_060_gl.20210203T0000Z] -job(01) failed
That to me looks like it receives two failure messages, one from the task and one from the poll. Basically, job polled, job fails (task failure), goes into retry state, poll result comes back (fail), state isn’t submit or running so the next state to go to is fail.
Am I correct? Any recommendations on avoiding this race condition (if I’m correct) - or is it fixed in later versions of Cylc?