Preventing alternate and main task from running and inconsistent task status

AsifulI · September 17, 2024, 4:58am

Hi Cylc team

My question is directly related to and an extension of the discussion covered in this post.

I’m trying to do the following -

Try to run main forecast (with 3 retries) (optional success and completion)
If it fails trigger short timestep forecast (compulsory success and completion)
In pre-script check the status of main forecast, and set to expired
If all goes well, the workflow moves on. However, I need a way to possibly remove the main forecast from the graph entirely.

I’ve basically got the first 3 steps working OK, based on past work in our team, and some updated commands in Cylc 8. However, there appears to be some challenges and confusing outputs for some of the Cylc commands -

I’ve found cylc workflow-state <workflow> to be the most consistent and cleanest to parse. However, only seems to work on scheduler VM and not on the HPC.
cylc dump <workflow> --tasks was fairly easy to parse, but appears to only show things in the core active window (n=0). The help guide suggests it’s for n=0 window
cylc show <workflow> was also easy to parse, and this is what I’m using at the moment. The help guide suggests it’s for n-window, so is there a way to choose the n value? Is it basically 1?

I am currently using cylc show to get my status of the main and alternate forecast task, similar to the snippet below. The main forecast task has the task_status as a local variable and checks for success of the alternate task when someone tries to run (despite it possibly being on expired status, so I’m trying to guard against operator error)

{% set FCST_PRE_SCRIPT_TO_LIMIT_TO_ONE = '
# Find out if the alternative UM task has been at least submitted
local task_status
task_status=$(cylc show "${CYLC_SUITE_NAME}//${CYLC_TASK_CYCLE_POINT}/${ALTERNATIVE_TASK_NAME}" \
    | grep "state" \
    | cut -d":" -f2 \
    | xargs)
case ${task_status} in
...
  succeeded)
    echo "WARNING: Alternative task has succeeded. Do not re-run. Removing from graph and aborting." | tee >(cat >&2)
    # cylc remove "${CYLC_SUITE_NAME}//${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
    # Initial attempts at this caused the task to stay on `running` state in the graph, but the logs suggest it succeeded.
    return 1
    ;;
esac
' %}

So, the attempt at removing the main task in its own pre-script didn’t work (commented out above) as it sits indefinitely on the Web UI graph as running status but logs are generated. At the moment I’m preventing running via the error messages and return 1 on main forecast task if the alternate is queued or running, or even succeeded. Here is a snippet of how I’m setting the main forecast task to expire in the alternate task’s pre-script.

{% set ALT_FCST_PRE_SCRIPT_TO_LIMIT_TO_ONE = FCST_PRE_SCRIPT_TO_LIMIT_TO_ONE ~ '
# The alternate task is running, so we want to hold the normal one so people do not accidently run it
cylc set -o expired "${CYLC_SUITE_NAME}//${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME/_REPLACEME_FCST_ALT/}"
' %}

Can you please clarify the use of cylc dump and cylc show, and which you’d recommend in such cases? I have found that when I have things like submit-failed on a current cycle, and a waiting task on the next cycle, it shows up as the same with something like

## From cylc dump
nq_um_fcst_18, 20240915T1800Z, waiting, not-held, not-queued, not-runahead
## From cylc show 
state: waiting
output completion: incomplete

I suppose it’s possible to use a combination of the state and the output completion variable values, but if I query the nq_um_fcst_19 (which spawned but cycle isn’t running) it shows the same waiting and incomplete status - so how does one differentiate? The nq_um_fcst_18 is currently on purple in the web UI suggesting its submission failed and I don’t see a way of picking this up unfortunately.

Finally, what would you recommend as the best approach to remove the main forecast task from the graph if I’m trying to guard against operators running failed or expired state tasks thinking it needs to run? I am thinking it might be possible to use an exit-script but I have no experience with this, the suite currently doesn’t use any, and the docs suggest that tasks should be “fast”. I’m not sure if a cylc show to determine success of the alternate task, and then cylc remove on the main forecast task is considered fast enough or is bad practice?

Thanks in advance for your help and any guidance on this!

AsifulI · September 17, 2024, 7:12am

I should also note another behaviour that I’ve just encountered, and will trip us up in large workflows. It appears that the Cylc WUI and the command-line interface is linked and changing the window extent changes the behaviour of the cylc show command.

With n=2 in the Cylc WUI we have the following outputs on the command-line -

>> cylc show ${CYLC_SUITE_NAME}//${CYLC_TASK_CYCLE_POINT}/um_fcst_ss_18  |  grep "state"
state: waiting

With n=1 in the Cylc WUI we have -

>> cylc show ${CYLC_SUITE_NAME}//${CYLC_TASK_CYCLE_POINT}/um_fcst_ss_18  |  grep "state"
No matching active tasks found: 20240915T1800Z/nq_um_fcst_ss_18

When dealing with operational workflows where we likely have the default n=1, neither the cylc dump or cylc show feature will work. This is at least the case for the first 3 attempts of the main forecast which will fail in the pre-script simply due to the alternate task (with _ss) name not being found in the active workflow.

We are currently deploying workflows through Gitlab pipelines and the default appears to be n=1, so it would be good to know i) how to redefine this, or better still, ii) avoid this dependence entirely.

Looking forward to your suggestions. Thanks!

oliver.sanders · September 17, 2024, 9:29am

Hi,

My question is directly related to and an extension of the discussion covered in this post.

Unfortunately, this link doesn’t seem to work.

If all goes well, the workflow moves on. However, I need a way to possibly remove the main forecast from the graph entirely.

With Cylc 8, we solve problems like this using “graph branching”.

Here’s a quick example (untested) of how to implement this:

The forecast task will retry up to three times.
If it still fails the forecast_shortstep task will run instead.
The process task will run after either of the forecast or forecast_shortstep tasks succeed.
There is no need to “remove” tasks from the graph manually.

[scheduling]
    [[graph]]
        P1Y = """
            forecast?
            forecast:fail? => forecast_shortstep

            forecast? | forecast_shortstep => process
        """

[runtime]
    [[FORECAST]]
        script = my-model

    [[forecast]]
        inherit = FORECAST
        execution retry delays = 3*PT1M

    [[forecast_shortstep]]
        inherit = FORECAST
        [[[environment]]]
            SHORTSTEP = true

    [[process]]

Note:

We don’t need to put the forecast task into the expired state.
There is no need to use pre-script to hack the task out at runtime.

AsifulI · September 17, 2024, 9:43am

Here is the link to the earlier post Blocking a task being triggered if its recovery task is active - Cylc / Cylc Support - Cylc Workflow Engine

oliver.sanders · September 17, 2024, 9:48am

Rather than loading the entire graph into memory, Cylc generates a “moving window” over your workflow. This is how Cylc is able to handle large graphs an infinite workflows. As a result, the command line and interfaces interact with this “window” and can only see the things inside of it.

The default is always n=1, there is not presently any way to change this default, however, you can alter it after the workflow has started e.g. via the GUI.

Cylc commands like cylc show and cylc dump are intended for monitoring running workflows. Like the GUI, they show you the status of the tasks that the workflow is currently operating on, i.e. the tasks in the scheduler’s window. These commands are not intended for querying historical task states (as the tasks may have drifted outside of the window by the time you make your query).

The cylc workflow-state command looks inside the workflow’s database rather than the scheduler’s “window”. This command is intended mostly for automated applications, e.g. triggering off of tasks in other workflows. This command requires access to the filesystem were the database file is located in order to work.

So, the attempt at removing the main task in its own pre-script didn’t work

Can you please clarify the use of cylc dump and cylc show , and which you’d recommend in such cases?

With Cylc, we should never need to introspect the state of tasks within a workflow to control the graph. Instead, we use graph branching (and custom outputs) to communicate task outcomes directly within the scheduler.

Further reading:

Take a look at the “recovery task” example in the graph branching documentation
If you’re coming from Cylc 7, there’s a page outlining how to migrate Cylc 7 workflows.
More information on the scheduer’s “window” (i.e. n=1, n=2, …).

oliver.sanders · September 17, 2024, 10:13am

Here is the link to the earlier post Blocking a task being triggered if its recovery task is active - Cylc / Cylc Support - Cylc Workflow Engine

Thanks, that’s useful.

If I understand correctly, you are worried that operators might manually trigger the forecast task whilst the forecast_shortstep task is running causing problems?

If so, I would recommend using a single forecast task and using cylc broadcast to toggle the shortstep behavior on and off. This is how we work at the MO.

Here’s an over-simplified example:

[runtime]
    [[forecast]]
        script = my-model
        [[[environment]]]
            SHORTSTEP = false

To turn shortstep on, issue a broadcast e.g:

$ cylc broadcast <workflow> -p <cycle> -n forecast -s '[environment]SHORTSTEP=true'

Any subsequent submission of this task will now run with SHORTSTEP=true.

You can automate this using “retry event handlers”. Here’s an example where the task will run with SHORTSTEP=false the first time, then all subsequent attempts will run with `SHORTSTEP=true":

[runtime]
    [[forecast]]
        script = my-model
        execution retry delays = 4*PT1M
        [[[events]]]
            retry event handlers = cylc broadcast $CYLC_WORKFLOW_ID -p $CYLC_TASK_CYCLE_POINT -n $CYLC_TASK_NAME -s '[environment]SHORTSTEP=true'
        [[[environment]]]
            SHORTSTEP = false

It’s likely your example isn’t quite as simple as mine above:

There might be multiple environment variables to change.
You might need to extend the execution time limit.
You want to retry three times before falling back to shortstep.

To handle this extra complexity, you might want to move the logic into a short script (e.g. in the workflow’s bin/ directory)

# flow.cylc

[runtime]
    [[forecast]]
        script = my-model
        execution retry delays = 4*PT1M
        execution time limit = PT30M
        [[[events]]]
            retry event handlers = shortstep-handler %(workflow)s %(cycle)s %(task)s %(try_num)d
        [[[environment]]]
            SHORTSTEP = false

# bin/shortstep-handler

set -euo pipefail

WORKFLOW="$1"
CYCLE="$2"
TASK="$3"
TRY_NUM="$4"

if [[ $TRY_NUM -gt 3 ]];
    # the task is on its third retry
    cylc broadcast "${WORKFLOW}" \
        -p "${CYCLE}" \
        -n "${TASK}" \
        -s '[environment]SHORTSTEP=true' \
        -s '[environment]FOO=bar' \
        -s 'execution time limit=PT1H'
fi

For a list of the %(variable)s available to event handlers (e.g. retry event handlers), see task event template variables.

hilary.j.oliver · September 19, 2024, 1:27am

That’s a good description of why the moving window on the graph is necessary - some Cylc workflows can be huge, and cycling Cylc workflows can even be infinite in extent.

For info though, we are in the process of extending some task matching functionality beyond the active window. You can already do that to hold and trigger individual future tasks, and more will come.

I would perhaps emphasize the word should there as well. There are still some edge cases - e.g. inter-cycle triggering in graphs with several wildly different cycling intervals - where the graph notation is not yet sufficiently powerful and you can (for instance) use an introspective xtrigger (an intrigger? ) as a short cut. As ever with power tools though, be careful not to blow your own foot off!

Topic		Replies	Views
Debug a stalling suite that uses expiring tasks Cylc 8 Migration	6	271	November 23, 2023
How to rewind workflow (not to a checkpoint) Cylc Support	19	361	October 3, 2023
Dependencies during boostrapping (R1 tasks) Cylc 8 Migration	19	330	November 16, 2023
Skipping tasks that aggregate forecasts Cylc Support	5	461	January 13, 2022
Always keep failed tasks visible? Cylc Support	5	211	October 12, 2023

Preventing alternate and main task from running and inconsistent task status

Related topics