Workflow consistently stalling

jonnyhtw · April 14, 2023, 2:28am

hey,

i have a workflow which is running fine but it keeps on stalling with messages like…

> cylc log u-cv956|grep -i 'is waiting on'
      * 20090701T0000Z/housekeeping is waiting on
      ['20090701T0000Z/postproc:succeeded']

this is unexpected since this task has definitely completed successfully and so i am having to use cylc set-outputs to fix these stalls. i have messed around with this suite a lot workflow i am a cylc 8 noob but i was just wondering if anyone knows of a reason why this might be happening and if so how to fix it?

i have an identical (in cylc terms anyway) workflow which is running fine so i know it’s something that i must have broken manually without meaning to!

btw i have manually run cylc triger --flow=new ... a couple of times to force a housekeeping task to rerun so that i could test a postprocessing workflow which is run as part of a post-script. this may well have messed things up.

cheers,

jonny

hilary.j.oliver · April 14, 2023, 4:13am

Hi @jonnyhtw

The fact that the identical workflow is running fine shows there’s nothing fundamentally wrong, which is good.

this is unexpected since this task has definitely completed successfully and so i am having to use cylc set-outputs to fix these stalls. … btw i have manually run cylc trigger --flow=new ... a couple of times

I suspect you have inadvertently started a new flow, with a new flow number, (i.e. a new self-perpetuating run through the graph) and the apparently-unsatisfied prerequisites pertain to the other flow.

Can you stop your workflow, then restart it with cylc play --debug so we can see flow numbers in the log. You can show me the results offline, and I’ll post an update here later once we’ve confirmed what’s happening.

hilary.j.oliver · April 14, 2023, 4:30am

Actually, flow numbers should still be reported without debug mode on. Do you see this sort of thing:

WARNING - [20200101T0000Z/foo failed job:01 flows:1] did not complete required outputs:
    ['succeeded']
INFO - [20200101T1200Z/foo running job:01 flows:1] => failed

But with flows other than 1?

jonnyhtw · April 18, 2023, 12:01am

hi @hilary.j.oliver, thanks for this and the offline chat, moving back here in case this is helpful for others in the future…

seems that by doing lots of this - - - > cylc trigger --flow=new... launched lots of new flows which then were left hanging/orphaned (?).

i then stopped them with cylc stop --flow=[number]... and this seems to have allowed the workflow to proceed a expected.

so, the question now is, how would i run a task in an arbitrary cycle point, which may well be significantly behind the runahead limit and not trigger any tasks further down the flow?

again this may seem odd in cylc 8 world but it is something that we need to do often-ish in climate modelling. an example might be to re-trigger a specific postproc task to generate a missing annual mean which wasn’t created for whatever reason.

cheers!

jonny

hilary.j.oliver · April 18, 2023, 12:58am

Yeah, I think what happened was you repeatedly launched new flows in parts of the graph with “off-flow” prerequisites that caused a stall: To illustrate what I mean by that:

off-flow

In this graph, if I trigger a new flow at 1/B, it will flow on through the whole graph downstream of 1/B, because success of B satisfies the prerequisites of both C and D, which satisfy E, which satisfies F.

But if I trigger a new flow at 1/C the workflow will stall with 1/E waiting on the “off-flow” task 1/D (which I did not trigger in the new flow, and which does not getting automatically triggered by a parent task in the new flow).

No biggie, I can still use cylc set-outputs to make the workflow carry on as if 1/D had succeeded, or I could manually trigger 1/E to make it run even though its prerequisites are not satisfied.

Unfortunately, in your workflow there were implicit (not visible in the graph) off-flow prerequisites due to declaring all tasks to be sequential (an implicit dependence on their own previous-cycle instance). This made it harder to see what the problem was. [Note we’ve recommended not using “sequential tasks” for a long time now].

Excellent question.

The ability to run multiple logically distinct flows through the same graph is a super-powerful new feature of Cylc 8.

If you manually trigger a task somewhere in the graph,

by default it belongs to the current active flow(s) - which is usually flow=1
but you can make it start a new flow (--flow=new)
or you can make it not belong to any flow (--flow=none)

With --flow=none the triggered task will run without starting a flow or affecting any other flows that may pass over it in the future.

Otherwise the triggered task will flow on IF its child-tasks (i.e. downstream in the graph) have not already run in this flow.

To re-run a part of the past graph, you have to start a new flow there, because the tasks downstream of the triggered one have already run in the original flow.

You can read all about flows here: Concurrent Flows — Cylc 8.2.2 documentation

jonnyhtw · April 18, 2023, 4:21am

this is great, thanks @hilary.j.oliver!

i’ve tested the --flow=none syntax and it works perfectly! thanks for the background info on this.

… we’ve recommended not using “sequential tasks” for a long time now.

does sequential mean something cylc-specific here? in climate modelling, the main number crunching task in cycle point N will always need to wait for that at N-1 to finish first since the starting state of N will be the same as the end state of N-1.

cheers,

hilary.j.oliver · April 18, 2023, 4:39am

By “sequential tasks” I mean this:

# (A)
[scheduling]
   [[special tasks]]
      sequential = foo, bar  # <----  implicit previous-cycle dependence!
   [[graph]]
      P1D = """
         foo => bar
      """

That is exactly equivalent to this:

# (B)
[scheduling]
   [[graph]]
      P1D = """
         foo => bar 
         foo[-P1D] => foo  # explicit, good!
         bar[-P1D] => bar
      """

But (B) is much better because the dependencies are explicit - you can see them in the graph, so you’re not going to forget about them and wonder why a task is waiting on a seemingly-invisible prerequisite. And you’re less likely to do it unnecessarily, again because all the dependencies are visible in the graph (unfortunately I’ve seen a lot of workflows where users seem to unnecessarily declare all the tasks as sequential “just in case”).

hilary.j.oliver · April 18, 2023, 4:41am

Yes, dead right - but my point is, it’s much better to make that dependence explicit in the graph.

jonnyhtw · April 18, 2023, 5:23am

OK got it, agreed!

fwiw i didn’t realise that ‘explicit’ meant ‘visible in the graph’, which has undoubtedly caused me to be even-slower-than-usual on the uptake here.

cheers,

jonny

jonnyhtw · April 18, 2023, 5:34am

brilliant thanks,

i’ll have a go at changing the graph to incorporate this! i’ll launch it as a different workflow to avoid any further issues haha.

cheers,

j

Topic		Replies	Views
How to handle workflow stalling Cylc Support	2	208	July 20, 2023
Debug a stalling suite that uses expiring tasks Cylc 8 Migration	6	271	November 23, 2023
How to restart from an earlier cycle? Cylc Support	7	261	September 5, 2023
Help with flow file that doesn't fully execute Cylc Support	8	548	January 25, 2022
Starting a new flow to re-run earlier clock-triggered cycle Cylc Support	7	200	March 4, 2024

Workflow consistently stalling

Related topics