Debug a stalling suite that uses expiring tasks

I am porting my old cylc7 suites to cylc8.

Here I am describing an issues I was facing when my workflow stalled and never progresses past the initial cycle point.

I’m starting the suite with

extract from rose-suite.conf:

[template variables]
START_CYCLE="20231101T0000Z"
...

extract from flow.cylc:

[scheduling]
    initial cycle point constraints = T00, T12
    initial cycle point = {{ START_CYCLE }}
    runahead limit = P1D
...

So the first cycle point is 20231101T0000Z

When I ran the suite it never got beyond that initial cyclepoint (other than the runahead) and eventually stalled. To investigate I revived it with cylc play and ran some checks:

$ cylc dump downloader_suite_cl4 | grep cycle
newest active cycle point=20231103T0000Z
oldest active cycle point=20231101T0000Z

So the oldest active cycle point hadn’t moved, it was stuck. So I inspected all tasks at that cycle point with

$ cylc show downloader_suite_cl4 "//20231101T0000Z/*" | grep state
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded

So all tasks that had been instantiated at that cycle point had been completed.

I looked at the status of thos tasks in more detail:

$ cylc dump -t downloader_suite_cl4 | grep 20231101T0000Z
crop_gpm_precip, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_med_analysis, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_med_forecast, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_medwam_analysis_00, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_medwam_forecast_00, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_mfwam_analysis_00, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_mfwam_forecast_00, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_psy4_analysis, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_psy4_forecast, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_ecmwf_era5, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_gpm_precip, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
housekeep, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
run_uscc_med_analysis, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
run_uscc_med_forecast, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
run_uscc_psy4_analysis, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
run_uscc_psy4_forecast, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead

Still nothing. Everything that needs to run has succeeded.

In my suites the housekeep task is intended to be the last task that runs in each cycle point. So I inspected this task:

$ cylc show downloader_suite_cl4 "//20231101T0000Z/housekeep"
title: (not given)
description: (not given)
URL: (not given)
state: succeeded
prerequisites: (n/a for past tasks)
outputs: ('-': not completed)
  - 20231101T0000Z/housekeep expired
  + 20231101T0000Z/housekeep submitted
  - 20231101T0000Z/housekeep submit-failed
  + 20231101T0000Z/housekeep started
  + 20231101T0000Z/housekeep succeeded
  - 20231101T0000Z/housekeep failed

AGain, the housekeep task has succeeded, the suite should move on:

So I inspected the shceduler logs:

$ grep 'is waiting on' /home/cylc/cylc-run/downloader_suite_cl4/run2/log/scheduler/log | grep 20231101T0000Z
      * 20231101T0000Z/upd_convert_gfs is waiting on ['20231101T0000Z/upd_get_ncep_gfs:succeeded']
      * 20231101T0000Z/upd_convert_gfs is waiting on ['20231101T0000Z/upd_get_ncep_gfs:succeeded']

So that tells me the cycle point is stuck because 20231101T0000Z/upd_convert_gfs is waiting for 20231101T0000Z/upd_get_ncep_gfs

This task has clock-trigger upd_get_ncep_gfs(+PT3H30M) and a clock-expire of upd_get_ncep_gfs(P9D), so it should have expired by now. So why is upd_convert_gfs still waiting for upd_get_ncep_gfs?

The relevant section of the graph looks like this:

        [[[T00]]]
            graph = """
                upd_get_ncep_gfs? => upd_convert_gfs
                (sst_get_ncep_gfs[+PT18H] | sst_get_ncep_gfs[+PT18H]:expired) => upd_convert_gfs

                (upd_get_ncep_gfs:expired | upd_convert_gfs) => housekeep
            """

The intention is for upd_convert_gfs to use the outputs of sst_get_ncep_gfs and upd_get_ncep_gfs, whereby the sst data is optional as it expires sooner. The convert task may run with gfs+sst data or just with gfs data.
But the upd_convert_gfs should definitely not run if neither tasks could have been completed, so should never be instantiated if both have expired.

And I think this is where my problem was - something had caused the upd_convert_gfs task to be instantiated, but now it was sitting there holding everything up.

I then realised that the problem must be in this line (sst_get_ncep_gfs[+PT18H] | sst_get_ncep_gfs[+PT18H]:expired) => upd_convert_gfs and I changed the graph to be like this:

        [[[T00]]]
            graph = """
                upd_get_ncep_gfs? => upd_convert_gfs
                (sst_get_ncep_gfs[+PT18H]? | sst_get_ncep_gfs[+PT18H]:expired?) => upd_convert_gfs

                (upd_get_ncep_gfs:expired | upd_convert_gfs) => housekeep
            """

I added the question mark to all tasks that could spawn a upd_convert_gfs task.

And now the suite runs without stalling.

I hope this helps someone.

I also have two questions for the support team:

  • Why did my command lines using cylc show ... | grep status not show a dangling upd_convert_gfs task? Those commands showed all tasks had been completed at that cycle point, but the dangling upd_convert_gfs task which was holding things up wasn’t shown as waiting. Why?

  • After running the suite for a while longer the workflow advanvced beyong the runahead of P1D, but cylc dump still showed oldest active cycle point=20231101T0000Z. Why? Shouldn’t that cycle point be garbage collected at some point? Currently the workflow is already on newest active cycle point=20231105T0000Z but the oldest cycle point is still shown? How does that get garbage collected?

Thanks

Hi @fredw - I will reclassify this post to the Cylc 8 migration topic. It’s a bit too complicated to be a Tip. Maybe we can distill a tip out of it in the end though.

It looks like you’re getting pretty good at workflow failure analysis :tada:

In the current (and previous) Cylc 8 releases, tasks in the active window with partially satisfied dependencies are kind of hidden from users. Our original thinking for the new scheduling algorithm was that tasks with multiple prerequisites should be spawned into the active window only once fully satisfied (like single prerequisite tasks are) and that the scheduler should manage partially satisfied tasks internally purely as a way to keep track of prerequisites.

However:

  • we’ve realized that was a mistake, and Cylc 8.3.0 will not do that. In the upcoming release, cylc dump will show the upd_convert_gfs in your case.
  • in the current release, even these “hidden” partially satisfied tasks do get revealed - in the scheduler log - if they result in a stall - which is what you’ve found

That will be the aforementioned partially satisfied task. It will eventually cause your workflow to stall again (when the front advances to the runahead limit).

Partially satisfied tasks always indicate that something is wrong, so they do not get cleaned up. What does get cleaned up is completed tasks - i.e., tasks that completed all of their required outputs and have thus fulfilled their function in the workflow.

So I think you still have a problem in your graph configuration.

A couple of thoughts, from looking at your graph snippets above:

sst_get_ncep_gfs[+PT18H] => stuff

That is a future trigger. It means stuff at cycle point t triggers off of sst_get_ncep_gfs at the future cycle point t + PT18H. Is that really your intention? There are use case for future triggers, but tasks normally trigger off of the previous cycle, e.g. sst_get_ncep_gfs[-PT18H].

I added the question mark to all tasks that could spawn a upd_convert_gfs task.

Make sure you understand what the question mark means:

foo => bar  # short for foo:succeed => bar

This means:

  • bar should trigger if foo succeeds
  • and foo’s success is required: if it does not succeed it will be retained in the active window as an incomplete task that will stall the workflow unless dealt with
foo? => bar  # short for foo:succeed? => bar

This means:

  • bar should trigger if foo succeeds (i.e., exactly the same as above!)
  • but foo’s success is optional: it can be cleaned up as complete whether it succeeds or fails (in effect we are saying we expect it to fail sometimes, and the graph handles that eventuality automatically)

Note the :expire output should always be marked optional, i.e. :expire?. (required expiry doesn’t make sense). This will be enforced at 8.3.0.

A note on partially satisfied tasks:

foo => qux
bar => qux

(or foo & bar => qux)

For qux to trigger both foo and bar must succeed. If foo succeeds and bar fails, then obviously qux will remain partially satisfied. When the workflow stalls as a result, the scheduler will log both of these reasons:

  • foo failed and is incomplete (its success was required)
  • qux is partially satisfied

That’s fairly obvious. foo failed and it wasn’t supposed to, which caused two obvious problems.

But your problem might be a bit less obvious. Consider this:

foo? => qux
bar => qux

Now if foo fails and bar succeeds, the scheduler can clean up foo because we said its success is optional. So there is no longer a failed task to stall the workflow. BUT qux is still partially satisfied, and that will stall the workflow.

This graph is probably a mistake. It says qux has two prerequisites, but one is expected to be satisfied and the other may not be.

Also, something could go wrong upstream that prevents qux being satisfied:

foo => qux
baz? => bar => qux

Now if foo succeeds and baz fails (and gets cleaned up because we said its success is optional) we end up with partially satisfied qux that is a bit harder to diagnose because the other branch leading to it got cut off higher up the graph.

Optional outputs ( with "?") should be used mainly for optional graph branching, where the flow should go one way or another at run time. Otherwise, be careful not to make a graph that dead-ends on the way to some required prerequisites.

(Sorry for the verbosity of this post - it’s hard to be concise when writing quickly, and I’m trying to cover the concepts that might lead you to find the problem. If you want to send the full workflow config as a direct message to me, I might be able to see it quickly - however, note that I’m going to be on leave Monday - Friday this week).

This is exactly what 's happening in my graph and I have no idea how to configure the new workflow engine to replicate the old suicide trigger behaviour.

This is a part of my workflow that causes (one of the) problems:

        [[[T00]]]
            graph = """
                upd_get_ncep_gfs? => upd_convert_gfs
                (sst_get_ncep_gfs[+PT18H] | sst_get_ncep_gfs[+PT18H]:expired?) => upd_convert_gfs

                (upd_get_ncep_gfs:expired? | upd_convert_gfs) => housekeep
            """

The behaviour I want to achieve in cylc8 is what this used to do in cylc7.9:

        [[[T00]]]
            graph = """
                upd_get_ncep_gfs & (sst_get_ncep_gfs[+PT18H] | sst_get_ncep_gfs[+PT18H]:expired) => upd_convert_gfs
                upd_get_ncep_gfs:succeed => ! upd_expire_convert_gfs
                upd_get_ncep_gfs:expired => upd_expire_convert_gfs
                upd_get_ncep_gfs:expired | upd_convert_gfs | upd_convert_gfs:expired => housekeep
            """

This construct has worked reliably and robustly for years. If presented unchanged to cylc8 it gives this error:

GraphParseError: Output upd_get_ncep_gfs:expired can't be both required and optional

This matches the explanation given by Hilary above, but I don’t know how to resolve this.

The desired behaviour:
The task upd_get_ncep_gfs downloads gfs data (but it expires after 7 days) and sst_get_ncep_gfs downloads the SST data (but this expires after 2 days), and the SST data is desired, but not required. So if the cycle point is encountered within the last 2 days the upd_convert_gfs will create a file that contains both the GFS meteo data and the SST data, this is the best outcome.

But if the cycle point is, e.g. 5 days, late then only the GFS data will be available (SST has expired) and upd_convert_gfs should be run only with GFS data.

if the cycle is encountered further than 7 days in the past, the upd_convert_gfs task should be expired. That is done by the upd_expire_convert_gfs task which executes

cylc reset "$CYLC_SUITE_NAME" "$CYLC_TASK_CYCLE_POINT/upd_convert_gfs" -s "expired"

something that used to work fine in cylc7, but doens’t seem to have an equivalent in cylc8.

Obviously, if upd_get_ncep_gfs has succeeded the upd_expire_convert_gfs task never needs to be instantiated. And the cylc7 suite just suicides it using the ! operator.

This is the kind of logic I have everywhere in my workflows, and I cannot find a suitable construct that will work in cylc8. The question mark operator doens’t seem to do what the suicide operator used to do, and even with my patched verison that understands the `:expire’ trigger my workflows don’t run as expected.

With this old vs new example, maybe someone can suggest how to get this working in cylc8.

Thank you very much everyone!

Fred

Actually it doesn’t - you must have added a “?” somewhere already to get that.

First thing to check, with 8.2.3 patch to get expire triggers working, Cylc 8 should be able to run the original config (still named suite.rc) in back-compat mode. Suicide triggers are still supported, mainly for backward compatibility. Have you tried this, and if so, does it run as expected?

The question mark operator doens’t seem to do what the suicide operator used to do

That’s true. The question mark (“optional output” marker), along with the new event-driven scheduler, make suicide triggers unnecessary.

# Cylc 7
foo => good1 => good2
foo:fail => bad1 => bad2
foo:fail => !good1
foo => ! bad1

Here, there are alternate success and fail branches. The Cylc 7 scheduler pre-spawned downstream tasks into “active window” before they were needed, which means both the good1 and bad1 task existed in the scheduler’s awareness before the outcome of foo was known. So the suicide triggers were needed to say, “if the good branch is needed, remove the bad branch”, and vice versa.

Here’s the Cylc 8 version:

# Cylc 8
foo? => good1 => good2
foo:fail? => bad1 => bad2

The question marks just indicate that foo:succeed is not required by the scheduler - we expect foo to fail sometimes, and the graph handles both eventualities. In Cylc 8 downstream tasks are not pre-spawned, so there is nothing to remove. If foo succeeds, good1 gets spawned. If foo fails, bad1 gets spawned.

I’ll take a look at your graph shortly …

You will be able to manually expire a task in 8.3.0. In the meantime, you can prevent that task from running (if it is already in the active window, with cylc remove).

[UPDATE] actually, can’t you just have upd_convert_gfs clock-expire at 7 days too? (then no forced manual expire is needed).

@fredw - I think you have hit a bug. My patch to back-port 8.3 expire handling probably wasn’t quite sufficient for parentless tasks (i.e., those with no upstream tasks to spawn them). Sorry about this! You’re clearly ahead of others, in terms in of moving critical use of :expire to Cylc 8, but this will be important to many of us soon - I’ll try to come back with a solution ASAP.

OK @fredw,

Here’s an easy workaround that I hope will see you through to the 8.3.0 release, tested with your graph fragment above:

  • add an additional single-line code change to your patched 8.2.3 release
  • add dummy tasks to your graph to trigger the expiring tasks (so they are no longer parentless)

the code patch

 $ git diff
diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index 8f27d7d9d..887ffa6e6 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -1798,7 +1798,7 @@ class TaskPool:

     def set_expired_tasks(self):
         res = False
-        for itask in self.get_tasks():
+        for itask in self.get_all_tasks():
             if self._set_expired_task(itask):
                 res = True
         return res

the dummy tasks

Here’s a runnable Cylc 8 translation of your Cylc 7 graph above, with parentless dummy tasks d1...3 that trigger your (formerly parentless) expiring tasks.

[scheduling]
   initial cycle point = 20231115  # ~8 days ago
   runahead limit = P1  # (easier to monitor progress)
   [[special tasks]]
      clock-expire = upd_get_ncep_gfs(P7D), upd_convert_gfs(P7D), sst_get_ncep_gfs(P2D)
   [[graph]]
      T00 = """
         d1 => upd_get_ncep_gfs
         d2 => upd_convert_gfs
         upd_get_ncep_gfs & (sst_get_ncep_gfs[PT18H]? | sst_get_ncep_gfs[PT18H]:expired?) => upd_convert_gfs
         upd_get_ncep_gfs:expired? | upd_convert_gfs | upd_convert_gfs:expired? => housekeep
      """
      T18 = """
         d3 => sst_get_ncep_gfs
         sst_get_ncep_gfs
      """
[runtime]
   [[root]]
      script = sleep 5
   [[d1, d2, d3]]
      # dummy tasks, to trigger the expiring tasks
   [[upd_get_ncep_gfs]]
   [[sst_get_ncep_gfs]]
   [[upd_convert_gfs]]
   [[housekeep]]

Note:

  • I’ve removed the upd_expire_convert_gfs task. As far as I can see, you don’t need it. Just clock-expire upd_convert_gfs instead of having a task that force-expires it via the CLI.
  • the initial point should be more than 7 days in the past, relative to real time, to check both expiry scenarios

I ran this with cylc vip, and monitored progress with cylc tui (and/or cylc vip --no-detach --no-timestamps to follow progress in the terminal).

The housekeep task runs in every cycle:

$ cylc log frr | sed -e 's/^.* - //' | grep -E '/housekeep .* => (submitted|expired)'
[20231115T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231116T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231117T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231118T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231119T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231120T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231121T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231122T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231123T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231124T0000Z/housekeep preparing job:01 flows:1] => submitted
[20231125T0000Z/housekeep preparing job:01 flows:1] => submitted

The upd_get_ncep_gfs and upd_convert_gfs tasks expire in the first two cycles, then run in subsequent cycles;

 $ cylc log frr | sed -e 's/^.* - //' | grep -E '/upd_get_ncep_gfs .* => (submitted|expired)'
[20231115T0000Z/upd_get_ncep_gfs waiting(queued) job:00 flows:1] => expired(queued)
[20231116T0000Z/upd_get_ncep_gfs waiting(queued) job:00 flows:1] => expired(queued)
[20231117T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231118T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231119T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231120T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231121T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231122T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231123T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231124T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231125T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231126T0000Z/upd_get_ncep_gfs preparing job:01 flows:1] => submitted
$ cylc log frr | sed -e 's/^.* - //' | grep -E '/upd_convert_gfs .* => (submitted|expired)'
[20231115T0000Z/upd_convert_gfs waiting job:00 flows:1] => expired
[20231116T0000Z/upd_convert_gfs waiting job:00 flows:1] => expired
[20231117T0000Z/upd_convert_gfs preparing job:01 flows:1] => submitted
[20231118T0000Z/upd_convert_gfs preparing job:01 flows:1] => submitted
[20231119T0000Z/upd_convert_gfs preparing job:01 flows:1] => submitted
[20231120T0000Z/upd_convert_gfs preparing job:01 flows:1] => submitted
[20231121T0000Z/upd_convert_gfs preparing job:01 flows:1] => submitted
[20231122T0000Z/upd_convert_gfs preparing job:01 flows:1] => submitted
[20231123T0000Z/upd_convert_gfs preparing job:01 flows:1] => submitted
[20231124T0000Z/upd_convert_gfs preparing job:01 flows:1] => submitted
[20231125T0000Z/upd_convert_gfs preparing job:01 flows:1] => submitted

And sst_get_ncep_gfs expires in the first 6 cycles, then starts running:

$ cylc log frr | sed -e 's/^.* - //' | grep -E '/sst_get_ncep_gfs .* => (submitted|expired)'
[20231115T1800Z/sst_get_ncep_gfs waiting(queued) job:00 flows:1] => expired(queued)
[20231116T1800Z/sst_get_ncep_gfs waiting(queued) job:00 flows:1] => expired(queued)
[20231117T1800Z/sst_get_ncep_gfs waiting(queued) job:00 flows:1] => expired(queued)
[20231118T1800Z/sst_get_ncep_gfs waiting(queued) job:00 flows:1] => expired(queued)
[20231119T1800Z/sst_get_ncep_gfs waiting(queued) job:00 flows:1] => expired(queued)
[20231120T1800Z/sst_get_ncep_gfs waiting(queued) job:00 flows:1] => expired(queued)
[20231121T1800Z/sst_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231122T1800Z/sst_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231123T1800Z/sst_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231124T1800Z/sst_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231125T1800Z/sst_get_ncep_gfs preparing job:01 flows:1] => submitted
[20231126T1800Z/sst_get_ncep_gfs preparing job:01 flows:1] => submitted