Tasks not clearing

puskar49 · December 14, 2023, 10:52pm

Hello again. I’m sure I’m missing something obvious here, but we’ve got a task that won’t clear the GUI. I confirmed it is causing the workflow to stall. In this case, the two sniffer tasks fail intentionally until the files are available. The dumpfile sniffer is clearing in this example, though there are two differences. One, the dumpfile sniffer succeeds on its first go since these are back in time and the start files are already present. Second, the dumpfile sniffer doesn’t send any cylc messages until the final “ready”, while the fieldsfile sniffer sends a message for each forecast hour. Either way, everything is working exactly as expected except that the fieldsfile sniffer isn’t clearing and thus stalling the workflow.

Here is the graph:

[[graph]]
      T00, T06, T12, T18 = """
        @start & purge[-PT6H]:finish => prune & purge
        @start => dumpfile_sniffer:ready => unpack => recon => forecast?
        forecast:start => fieldsfile_sniffer:ready<fcsthr> => process<fcsthr> => trim<fcsthr>
        process<fcsthr> => tracker => finish

        forecast:fail? => forecast_ss
        dumpfile_sniffer:delayed? => send_dumpfile_alert?
        fieldsfile_sniffer:delayed? => send_fieldsfile_alert?

Here’s the GUI:

hilary.j.oliver · December 15, 2023, 4:48am

Does the workflow literally stall, as in report in the scheduler log that it has stalled, or does it just hang as if still waiting on something?

If it stalls, the log should list the partially satisfied or incomplete tasks that caused the stall.

Otherwise, does the offending sniffer task actually exit - either with success or failure - and is that exit status picked up and logged by the scheduler?

hilary.j.oliver · December 15, 2023, 5:02am

I don’t think you’ve given enough detail for me to see what’s going wrong. One idea, just in case:

...
        forecast:start => fieldsfile_sniffer:ready<fcsthr>  # no '?'
...

All of your individual :ready<fcsthr> outputs are required, not optional. So if the task finishes without completing every single one of them, it will be retained in the active window as an incomplete task, and that will eventually stall the workflow at the runahead limit. However, that would be logged as the stall reason, as I mentioned above.

puskar49 · December 15, 2023, 1:28pm

Yes, it literally stalls.

The scheduler log states that none of the ready_fcsthrXXX outputs completed.

The fieldsfile_sniffer job does finish and is marked as completed. However, the scheduler log states that the ready_fcsthrXXX outputs didn’t complete. However, those outputs were sent by the fieldsfile_sniffer job and they in turn correctly triggered the process jobs. So it’s not clear exactly what is happening here.

puskar49 · December 15, 2023, 1:30pm

Correct, and they should be required. These are message triggers. All of the messages are being sent and the appropriate post process jobs are being triggered. So I’m not sure what constitutes “finishing” a cylc task message trigger here.

puskar49 · December 15, 2023, 1:32pm

Also note that the dumpfile sniffer uses the same mechanic except that it doesn’t use parameterization for multiple message triggers, just a single message trigger. In that case, the task is being marked as completed and it is clearing.

oliver.sanders · December 15, 2023, 4:07pm

Dammit, there’s a bug!

A task which completes multiple required outputs over multiple submissions will currently not get marked as complete when it finishes. I’m very surprised that no one has hit this before!

github.com/cylc/cylc-flow

optional outputs: tasks which complete outputs on multiple tries marked as incomplete

opened 03:27PM - 15 Dec 23 UTC

oliver-sanders

bug

A nasty bug where a task which generates multiple required outputs, but over mul…tiple submissions, is erroneously marked as incomplete. ```ini [scheduling] [[graph]] R1 = """ a:1 & a:2 & a:3 => z a? """ [runtime] [[a]] script = """ cylc message -- $CYLC_TASK_TRY_NUMBER false """ execution retry delays = 2*PT0S [[[outputs]]] 1 = 1 2 = 2 3 = 3 [[z]] ``` Selected log entries: ``` INFO - [1/a running job:01 flows:1] (received)1 at 2023-12-15T15:20:24Z WARNING - [1/a waiting job:01 flows:1] retrying in P0Y (after 2023-12-15T15:20:29Z) INFO - [1/a running job:02 flows:1] (received)2 at 2023-12-15T15:20:34Z WARNING - [1/a waiting job:02 flows:1] retrying in P0Y (after 2023-12-15T15:20:37Z) INFO - [1/a running job:03 flows:1] (received)3 at 2023-12-15T15:20:42Z INFO - [1/z waiting(queued) job:00 flows:1] => waiting INFO - [1/z waiting job:01 flows:1] => preparing INFO - [1/a running job:03 flows:1] => failed WARNING - [1/a failed job:03 flows:1] did not complete required outputs: ['1', '2'] INFO - [1/z running job:01 flows:1] => succeeded ``` So all three custom outputs are received, and the downstream task does run, but the upstream task remains (erroneously) marked as incomplete.

The best workaround I can come up with is to write your outputs to a file and message the whole lot back at the end of the task script, here’s an example:

[scheduling]
    [[graph]]
        R1 = """
            a:1 & a:2 & a:3 => z
            a?
        """

[runtime]
    [[a]]
        script = """
            # add the output to the outputs file
            touch outputs
            echo "$CYLC_TASK_TRY_NUMBER" >> outputs

            # send all outputs back to Cylc
            while IFS= read -r output; do
                cylc message -- "$output"
            done < outputs

            # fail (retries configured)
            false
        """
        execution retry delays = 2*PT0S
        [[[outputs]]]
            1 = 1
            2 = 2
            3 = 3
    [[z]]

puskar49 · December 15, 2023, 5:03pm

Dammit, there’s a bug!

A task which completes multiple required outputs over multiple submissions will currently not get marked as complete when it finishes. I’m very surprised that no one has hit this before!

Thanks for the heads up! I think for now we can just add the optional marker to the output messages to let the graph clear. That doesn’t seem to be causing us any problems, so I think it’s just a matter of having the “correct” graph once the bug has been addressed.

I once again appreciate the rapid responses. Cheers.

hilary.j.oliver · December 15, 2023, 8:39pm

Good spotting @oliver.sanders - I missed “completed multiple outputs over multiple retries bit”, that’s definitely the problem.

The good news is it works already on the cylc set dev branch for 8.3.0.

A bit of background for @puskar49 -

Given that optional outputs can be completed in this way in the current release, I think the required output behaviour (in the the current release) was deliberate. We originally figured that all required outputs must be completed in any single execution of the job. That’s the natural interpretation of “required output” for a model run, say. The reason that is wrong, in the context of the workflow, is a bit subtle: downstream tasks get triggered individually as each output is completed, so in terms of the workflow it doesn’t matter it if takes multiple job retries to do that.

Topic		Replies	Views
Debug a stalling suite that uses expiring tasks Cylc 8 Migration	6	271	November 23, 2023
Previously completed task not disappearing from workflow Cylc Support	16	195	June 27, 2023
Preventing alternate and main task from running and inconsistent task status Cylc Support	6	47	September 19, 2024
Workflow consistently stalling Cylc Support	9	285	April 18, 2023
Showing workflow state after run has finished Cylc Support	3	207	July 3, 2023

Tasks not clearing

Related topics