Blocking a task being triggered if its recovery task is active

Hi,

There have been multiple instances of trigger happy support staff seeing a failed task and just forcing it to run without thought. In many cases that is not a problem. But if if an alternate/recovery task is running then you can get into a lot of problems. For example, if the model is running with a shorter timestep and then the normal timestep starts to run at the same time - ignoring the waste of resources - the normal timestep will overwrite data the short timestep has produced, and lead to a lot of data corruption or oddities in data.

Simple e.g graph, mentally you can add in ? or ! as you see fit for handling only one needing to succeed.

um:failed => um_ss
um | um_ss => next_task

In Cylc 7, we came up with this snippet. If the alternate task is running/submitted/queued/ready, abort immediately. Also, to try to reduce this behaviour, we change the normal tasks task state from failed to expired for to indicate it should not be rerun.

{% set FCST_PRE_SCRIPT_TO_LIMIT_TO_ONE = '
# Find out if the alternative UM task has been at least submitted
local task_status
task_status=$(cylc dump "${CYLC_SUITE_NAME}" --tasks \
    | grep "^${ALTERNATIVE_TASK_NAME}, ${CYLC_TASK_CYCLE_POINT}" \
    | cut -d"," -f3 \
    | xargs)
case ${task_status} in
  ready | submitted | running | queued)
    echo "USER ERROR: Automatic recovery is in progress.
This task should not be executed whilst $ALTERNATIVE_TASK_NAME is already running.
Please view the documentation for this task." | tee >(cat >&2)
    return 1
    ;;
esac
' %}
{% set ALT_FCST_PRE_SCRIPT_TO_LIMIT_TO_ONE = FCST_PRE_SCRIPT_TO_LIMIT_TO_ONE ~ '

# The alternate task is running, so we want to hold the normal one so people do not accidently run it
cylc reset -s expired "${CYLC_SUITE_NAME}" "${CYLC_TASK_NAME/_REPLACEME_FCST_ALT/}.${CYLC_TASK_CYCLE_POINT}"
' %}

I guess I have two questions

  1. Is it possible natively in Cylc to block running of task A if task B is running. Not just limit it by queues, because then A would run after B finishes, but, completely block it (preferably with an error message).
  2. How would the above Cylc7 code be translated to Cylc8? With there being less task states (and you can’t just reset a state like in Cylc7), I can’t picture how to stop a task looking like its failed and looking like it is failed when its really failed, but please do not touch it because we are recovering automatically.

Thanks for any assistance on this one.

No. It’s an interesting idea that could be worth considering, but my first thought is that it goes against the primacy of the workflow owner (and potentially, authorized users) in overriding the scheduler if they deem it necessary.

You can still do the first bit very easily in current Cylc 8 releases: have the task scripting fail immediately if it detects that the recovery task is already running or has already run.

The reset-to-expired bit is currently not supported, but is coming very soon as part of our rationalization of intervention capabilities in Cylc 8.

However:

An expired task can be manually triggered like any other, so that doesn’t seem to me like a very effective way of telling the operator not to trigger the task.

Maybe have a job submission event handler on the alternate task send some kind of alert (in fact that could also be used to expire the main task if you really think that helps).

Note also, in Cylc 8 the alternate succeed/fail paths have to be marked “optional”, which means that the failed main task, while still being “failed”, is not going to be retained by the scheduler as an incomplete task, which allows us to de-emphasive the failure (compared to failures that are not handled by the graph).

So, the big advantage we have with Cylc 8 is that support staff no longer need to worry about failed tasks - they only need to worry about incomplete ones. I think we still have work to do to ensure incomplete tasks are highlighted in the UI.

1 Like

Yes, that is why we have the abort logic in there. The expiry is simply to hide the fail state so operators don’t even try. Some see red and think they need to do something.

It would not need to override said primacy. It could be a prompt. “It is recommended that A is not run while B is running. Are you sure you want to trigger A?”. This then warns that the their action goes against the advice of the workflow designer, but gives them the option to continue. An intentional pause and prompt for their thought process.

I’m not experienced enough with cylc8 to really understand this and how visible that is to a user monitoring 30 workflow at one time, as well as every other operational non HPC/Cylc system, observation station, etc.

At the moment it’s not really visible - this is something we’ll need to work on to take full advantage of this new feature of Cylc 8.