Recommended way to have a task succeed after a few retries?

I was wondering if there was a recommended way to have a task succeed after a few retries? There are two I can think of, but I wanted to know if there was a preferred way.

  1. in the script section
    script = """
    if ((CYLC_TASK_TRY_NUMBER > 3)); then
        echo "WARNING: ...." >&2
        return 0
    fi
    normal_command
    """
    execution retry delays = PT10S, PT30S, PT1S
    
  2. err-script overriding CYLC_TASK_USER_SCRIPT_EXITCODE
    script = normal_command
    err-script = if ((CYLC_TASK_TRY_NUMBER > 2)); then CYLC_TASK_USER_SCRIPT_EXITCODE=0; fi
    execution retry delays = PT10S, PT30S
    

I’m guessing the preferred approach is the first one as the second is modifying more internal CYLC variables? What I don’t like about the first one is if you have a standard script section defined at root, e.g. rose task-run various options, then you have to repeat it (usually by jinja2 or equivalent), whereas the second could be defined in a family that you inherit from (e.g. inherit = ..., AUTO_COMPLETE).

Are there other alternatives that would be cleaner than either of these?

p.s. feature request to have this definable in a config item to auto-succeed after N retries?

I’m not aware of a preferred way to do this, but that’s probably because I’ve never come across the need for it, except in dummy workflows for testing Cylc itself (in which I do it like your case 1.)

p.s. feature request to have this definable in a config item to auto-succeed after N retries?

I think we’d need a strong use case to consider doing that. If accidentally misused it could easily hide real task failures.

Are there other alternatives that would be cleaner than either of these?

Your example makes it look as if you want a task to fail several times before even running the real command, which will then presumably succeed. Is that right? If so, why?

Initially I thought you meant the real command fails, but after a few retries carry on as if it succeeded.
In that case I’d suggest don’t hide the failure, it really did fail, but use an event handler to “set the succeeded output” so that the scheduler can carry on as if the task succeeded.

That’s not what it is doing. If the try number is larger than N, then just exit. If its less, run the normal command.

I had not considered that approach. That sounds like it could be a more proper approach. Thanks.

Oops, my bad. For some reason I read it as CYLC_TASK_TRY_NUMBER < 3 (less than!).

At 8.2.4 you’ll have to use cylc set-outputs to trigger the tasks that depend on the succeeded outputs, and cylc remove to remove the incomplete failed task.

At 8.3.0 cylc set will complete the outputs of the target task as well as trigger the downstream tasks, so the remove bit becomes unnecessary.

Instead of making the task succeed after 3 tries, could you make succeed an optional output?

I.e. (foo? | foo:fail?) => bar

So when it fails after 3 tries, the workflow doesn’t stall and continues on.

Edit: actually, you can just do

foo:finish => bar

which is the same thing