Execution retry not working, task state showing as waiting

prajeeshag · December 14, 2022, 5:02am

Hi,

I have a task “check_IC_available” with “execution retry intervals” and “submission retry intervals” set to 20*PT5M. But it does not resubmit after failing rather it goes to waiting state.

The following is the scheduler log:

2022-12-14T02:28:12+05:30 INFO - [20221214T0000Z/check_IC_available submitted job:03 flows:1] poll now, (next in PT15M (after 2022-12-14T02:43:12+05:30))
2022-12-14T02:28:17+05:30 INFO - [20221214T0000Z/check_IC_available submitted job:03 flows:1] (polled)failed at 2022-12-14T02:14:16+05:30
2022-12-14T02:28:17+05:30 INFO - [20221214T0000Z/check_IC_available submitted job:03 flows:1] => waiting
2022-12-14T02:28:17+05:30 WARNING - [20221214T0000Z/check_IC_available waiting job:03 flows:1] retrying in PT5M (after 2022-12-14T02:33:17+05:30)
2022-12-14T02:33:18+05:30 INFO - xtrigger satisfied: _cylc_retry_20221214T0000Z/check_IC_available = wall_clock(trigger_time=1670965397.0793738)

hilary.j.oliver · December 14, 2022, 7:16am

If a task fails with retries configured, it is supposed to go back to the waiting state, to wait on the configured delay before the next try.

retrying in PT5M (after 2022-12-14T02:33:17+05:30)

Are you saying that the task did not resubmit its job after the PT5M (5 minute) delay?

Your log excerpt doesn’t say one way or the other, as the job submission would come after the last line.

prajeeshag · December 14, 2022, 7:32am

Yes, the task was not resubmitted after the PT5M. There is nothing related to this task after this line in the log
“2022-12-14T02:33:18+05:30 INFO - xtrigger satisfied: _cylc_retry_20221214T0000Z/check_IC_available = wall_clock(trigger_time=1670965397.0793738)”

hilary.j.oliver · December 14, 2022, 8:13am

Is this repeatable? And can you tell us what Cylc version you are running?

prajeeshag · December 14, 2022, 3:40pm

It is cylc version 8.0.3. I tried a simplified version of the same workflow to recreate this issue, but that is resubmitting the job correctly.

Maybe I should try reinstalling the workflow.

2022-12-14T20:32:05+05:30 INFO - [20221209T0000Z/check_IC_available submitted job:04 flows:1] health: submission timeout=None, polling intervals=PT15M,…
2022-12-14T20:47:05+05:30 INFO - [20221209T0000Z/check_IC_available submitted job:04 flows:1] poll now, (next in PT15M (after 2022-12-14T21:02:05+05:30))
2022-12-14T20:47:09+05:30 INFO - [20221209T0000Z/check_IC_available submitted job:04 flows:1] (polled)failed at 2022-12-14T20:32:53+05:30
2022-12-14T20:47:09+05:30 INFO - [20221209T0000Z/check_IC_available submitted job:04 flows:1] => waiting
2022-12-14T20:47:09+05:30 WARNING - [20221209T0000Z/check_IC_available waiting job:04 flows:1] retrying in PT5M (after 2022-12-14T20:52:09+05:30)
2022-12-14T20:52:09+05:30 INFO - xtrigger satisfied: _cylc_retry_20221209T0000Z/check_IC_available = wall_clock(trigger_time=1671031329.4756932)
2022-12-14T20:52:09+05:30 INFO - [20221209T0000Z/check_IC_available waiting job:04 flows:1] => waiting(queued)
2022-12-14T20:52:09+05:30 INFO - [20221209T0000Z/check_IC_available waiting(queued) job:04 flows:1] => waiting
2022-12-14T20:52:09+05:30 INFO - [20221209T0000Z/check_IC_available waiting job:05 flows:1] => preparing
2022-12-14T20:52:09+05:30 INFO - [20221209T0000Z/check_IC_available preparing job:05 flows:1] host=10.119.14.15
2022-12-14T20:52:13+05:30 INFO - [20221209T0000Z/check_IC_available preparing job:05 flows:1] (internal)submitted at 2022-12-14T20:52:47+05:30
2022-12-14T20:52:13+05:30 INFO - [20221209T0000Z/check_IC_available preparing job:05 flows:1] submitted to elogin04:background[16273]

oliver.sanders · December 14, 2022, 3:44pm

I can think of two things which are supposed to suppress retries:

If the task was “held” (marked with a paused badge next to the task icon in the UI).
If the workflow was “paused” (the word paused should appear in the toolbar in the UI).

oliver.sanders · December 14, 2022, 3:59pm

Reinstalling shouldn’t make a difference (this only syncs files from cylc-src to cylc-run). Restarting could make a difference because it would wipe any cached state in the scheduler (the state is preserved in the database).

If it can be reproduced, for diagnosis you could try running the workflow in --debug mode. Then send us the log file from the failure up to the first entry after the task should have retried. This might contain something informative.

prajeeshag · December 14, 2022, 4:17pm

Okay. I will try that.

Topic		Replies	Views
Execution retry delays and execution job polling Cylc Support	1	703	February 4, 2021
Cylc 7: Retry on submit-failed rather than failed? Cylc Support	1	175	February 7, 2024
"fail" qualifier with execution retries Cylc Support	1	259	December 8, 2022
CYLC_TASK_TRY_NUMBER only updates when auto-retries are executed? Cylc Support	2	326	September 29, 2021
Recommended way to have a task succeed after a few retries? Cylc Support	4	137	March 4, 2024

Execution retry not working, task state showing as waiting

Related topics