Execution retry not working, task state showing as waiting

Hi,

I have a task “check_IC_available” with “execution retry intervals” and “submission retry intervals” set to 20*PT5M. But it does not resubmit after failing rather it goes to waiting state.

The following is the scheduler log:

2022-12-14T02:28:12+05:30 INFO - [20221214T0000Z/check_IC_available submitted job:03 flows:1] poll now, (next in PT15M (after 2022-12-14T02:43:12+05:30))
2022-12-14T02:28:17+05:30 INFO - [20221214T0000Z/check_IC_available submitted job:03 flows:1] (polled)failed at 2022-12-14T02:14:16+05:30
2022-12-14T02:28:17+05:30 INFO - [20221214T0000Z/check_IC_available submitted job:03 flows:1] => waiting
2022-12-14T02:28:17+05:30 WARNING - [20221214T0000Z/check_IC_available waiting job:03 flows:1] retrying in PT5M (after 2022-12-14T02:33:17+05:30)
2022-12-14T02:33:18+05:30 INFO - xtrigger satisfied: _cylc_retry_20221214T0000Z/check_IC_available = wall_clock(trigger_time=1670965397.0793738)

If a task fails with retries configured, it is supposed to go back to the waiting state, to wait on the configured delay before the next try.

retrying in PT5M (after 2022-12-14T02:33:17+05:30)

Are you saying that the task did not resubmit its job after the PT5M (5 minute) delay?

Your log excerpt doesn’t say one way or the other, as the job submission would come after the last line.

Yes, the task was not resubmitted after the PT5M. There is nothing related to this task after this line in the log
“2022-12-14T02:33:18+05:30 INFO - xtrigger satisfied: _cylc_retry_20221214T0000Z/check_IC_available = wall_clock(trigger_time=1670965397.0793738)”

Is this repeatable? And can you tell us what Cylc version you are running?

It is cylc version 8.0.3. I tried a simplified version of the same workflow to recreate this issue, but that is resubmitting the job correctly.

Maybe I should try reinstalling the workflow.

2022-12-14T20:32:05+05:30 INFO - [20221209T0000Z/check_IC_available submitted job:04 flows:1] health: submission timeout=None, polling intervals=PT15M,…
2022-12-14T20:47:05+05:30 INFO - [20221209T0000Z/check_IC_available submitted job:04 flows:1] poll now, (next in PT15M (after 2022-12-14T21:02:05+05:30))
2022-12-14T20:47:09+05:30 INFO - [20221209T0000Z/check_IC_available submitted job:04 flows:1] (polled)failed at 2022-12-14T20:32:53+05:30
2022-12-14T20:47:09+05:30 INFO - [20221209T0000Z/check_IC_available submitted job:04 flows:1] => waiting
2022-12-14T20:47:09+05:30 WARNING - [20221209T0000Z/check_IC_available waiting job:04 flows:1] retrying in PT5M (after 2022-12-14T20:52:09+05:30)
2022-12-14T20:52:09+05:30 INFO - xtrigger satisfied: _cylc_retry_20221209T0000Z/check_IC_available = wall_clock(trigger_time=1671031329.4756932)
2022-12-14T20:52:09+05:30 INFO - [20221209T0000Z/check_IC_available waiting job:04 flows:1] => waiting(queued)
2022-12-14T20:52:09+05:30 INFO - [20221209T0000Z/check_IC_available waiting(queued) job:04 flows:1] => waiting
2022-12-14T20:52:09+05:30 INFO - [20221209T0000Z/check_IC_available waiting job:05 flows:1] => preparing
2022-12-14T20:52:09+05:30 INFO - [20221209T0000Z/check_IC_available preparing job:05 flows:1] host=10.119.14.15
2022-12-14T20:52:13+05:30 INFO - [20221209T0000Z/check_IC_available preparing job:05 flows:1] (internal)submitted at 2022-12-14T20:52:47+05:30
2022-12-14T20:52:13+05:30 INFO - [20221209T0000Z/check_IC_available preparing job:05 flows:1] submitted to elogin04:background[16273]

I can think of two things which are supposed to suppress retries:

  • If the task was “held” (marked with a paused badge next to the task icon in the UI).
  • If the workflow was “paused” (the word paused should appear in the toolbar in the UI).

Reinstalling shouldn’t make a difference (this only syncs files from cylc-src to cylc-run). Restarting could make a difference because it would wipe any cached state in the scheduler (the state is preserved in the database).

If it can be reproduced, for diagnosis you could try running the workflow in --debug mode. Then send us the log file from the failure up to the first entry after the task should have retried. This might contain something informative.

Okay. I will try that.