Recommended way to have a task succeed after a few retries?

TomC · March 4, 2024, 2:47am

I was wondering if there was a recommended way to have a task succeed after a few retries? There are two I can think of, but I wanted to know if there was a preferred way.

in the script section

script = """
if ((CYLC_TASK_TRY_NUMBER > 3)); then
    echo "WARNING: ...." >&2
    return 0
fi
normal_command
"""
execution retry delays = PT10S, PT30S, PT1S

err-script overriding CYLC_TASK_USER_SCRIPT_EXITCODE

script = normal_command
err-script = if ((CYLC_TASK_TRY_NUMBER > 2)); then CYLC_TASK_USER_SCRIPT_EXITCODE=0; fi
execution retry delays = PT10S, PT30S

I’m guessing the preferred approach is the first one as the second is modifying more internal CYLC variables? What I don’t like about the first one is if you have a standard script section defined at root, e.g. rose task-run various options, then you have to repeat it (usually by jinja2 or equivalent), whereas the second could be defined in a family that you inherit from (e.g. inherit = ..., AUTO_COMPLETE).

Are there other alternatives that would be cleaner than either of these?

p.s. feature request to have this definable in a config item to auto-succeed after N retries?

hilary.j.oliver · March 4, 2024, 4:04am

I’m not aware of a preferred way to do this, but that’s probably because I’ve never come across the need for it, except in dummy workflows for testing Cylc itself (in which I do it like your case 1.)

p.s. feature request to have this definable in a config item to auto-succeed after N retries?

I think we’d need a strong use case to consider doing that. If accidentally misused it could easily hide real task failures.

Are there other alternatives that would be cleaner than either of these?

Your example makes it look as if you want a task to fail several times before even running the real command, which will then presumably succeed. Is that right? If so, why?

Initially I thought you meant the real command fails, but after a few retries carry on as if it succeeded.
In that case I’d suggest don’t hide the failure, it really did fail, but use an event handler to “set the succeeded output” so that the scheduler can carry on as if the task succeeded.

TomC · March 4, 2024, 4:39am

That’s not what it is doing. If the try number is larger than N, then just exit. If its less, run the normal command.

I had not considered that approach. That sounds like it could be a more proper approach. Thanks.

hilary.j.oliver · March 4, 2024, 4:55am

Oops, my bad. For some reason I read it as CYLC_TASK_TRY_NUMBER < 3 (less than!).

At 8.2.4 you’ll have to use cylc set-outputs to trigger the tasks that depend on the succeeded outputs, and cylc remove to remove the incomplete failed task.

At 8.3.0 cylc set will complete the outputs of the target task as well as trigger the downstream tasks, so the remove bit becomes unnecessary.

MetRonnie · March 4, 2024, 3:01pm

Instead of making the task succeed after 3 tries, could you make succeed an optional output?

I.e. (foo? | foo:fail?) => bar

So when it fails after 3 tries, the workflow doesn’t stall and continues on.

Edit: actually, you can just do

foo:finish => bar

which is the same thing

Topic		Replies	Views
CYLC_TASK_TRY_NUMBER only updates when auto-retries are executed? Cylc Support	2	326	September 29, 2021
Retry from a previous task Cylc Support	2	205	August 23, 2023
Cylc 7: Retry on submit-failed rather than failed? Cylc Support	1	175	February 7, 2024
Execution retry not working, task state showing as waiting Cylc Support	7	216	December 14, 2022
"fail" qualifier with execution retries Cylc Support	1	259	December 8, 2022

Recommended way to have a task succeed after a few retries?

Related topics