Cylc broadcast error: WorkflowStopped

Good afternoon,

I am seeking advice for debugging a broadcast.

In particular, I have a task that, after computing a variable, tries to send it to another task.

Let’s say that this is my task (named hindcast_recup_SST)

#!/bin/bash
set -euo pipefail 
# Count the number of files, that will equal the number of tasks needed to create superobs
n_task=$(ls -lrt ${MY_PATH} | wc -l) 
# Send the number of tasks necessary to create superobs task
cylc broadcast ${CYLC_WORKFLOW_ID}  -p ${CYLC_TASK_CYCLE_POINT} \
                                    -n  hindcast_create_superobs_SST \
                                    -s "[directives]--ntasks=${n_task}" 

hindcast_create_superobs_SST is the task where I need to set the ntasks directive.

Now the problem: the first task fails with the following error

WorkflowStopped: glo12_cylc/run1 is not running
2025-05-13T12:48:24Z CRITICAL - failed/ERR

And the error is caused by the broadcast, since if I remove it, the task succeeds.
I tried to run the broadcast by hand on the command line, and it works

$ cylc broadcast glo12_cylc/run1 -p 20241111 -n hindcast_create_superobs_SST -s “[directives]–ntasks=4”
Broadcast set:

  • [20241111/hindcast_create_superobs_SST] [directives]–ntasks=4

Now, I made many test, and I am quite at loss.

First of all, how does cycl check if a workflow is running? Normally I check my workflow it with cycl tui, and there is marked as running, but from the error message, I understand there must be some variable that is telling the opposite to cycl.

Second: how do I debug this? I tryed to run the workflow in debug, but I see just an error like

DEBUG - [20241111/hindcast_recup_SST/01:submitted] (polled)failed
INFO - [20241111/hindcast_recup_SST/01:submitted] setting implied output: started
DEBUG - [20241111/hindcast_recup_SST/01:submitted] (internal)started
INFO - [20241111/hindcast_recup_SST/01:submitted] => running
DEBUG - [20241111/hindcast_recup_SST/01:running] health: execution timeout=None, polling intervals=PT1M,3*PT2M,PT1M,…
INFO - [20241111/hindcast_recup_SST/01:running] => failed
WARNING - [20241111/hindcast_recup_SST/01:failed] did not complete the required outputs:
⨯ ┆ succeeded

I don’t know how to retrieve more information on the failure, I added a --debug to the broadcast command without much success. Also, I have previous broadcast in the workflow, and they work. It could be that some previous broadcast fails, and this one is just trapping the error?

Any help will be highly appreciated!

Best,
Stella

First of all, how does cycl check if a workflow is running?

Cylc workflows create a special file when they are started containing information about where the scheduler is running. You can find this file in ~/cylc-run/<workflow-id>/.service/contact. The file is removed when the scheduler shuts down.

All Cylc utilities, from cylc broadcast to cylc gui use this file to determine if the workflow is running.


Second: how do I debug this?

Cylc needs to keep track of the status of the jobs it submits. There are two options for this:

  • Push - Cylc jobs send messages back to the scheduler.
  • Pull - The scheduler routinely checks thejob.status file that each job creates.

Looking at this log snippet:

DEBUG - [20241111/hindcast_recup_SST/01:submitted] (polled)failed

The (polled) bit means that Cylc is using “pull” based communication (aka “polling”) for this job.

Unfortunately, that means you won’t be able to push data back to the workflow, e.g. using cylc broadcast.

Push communication is the default, however, Cylc will fallback to pull communication if:

So either:

  • Cylc has been configured to use pull communication.
  • Or, there’s something wrong with the setup.

To find out which, run the cylc config -i '[platforms]' and find the entry for the platform you are using to run this job. If it includes method = poll then Cylc has been configured for pull communication.

If the platform is not configured for pull communication, this section of the troubleshooting guide has some information on debugging:

Hi Oliver,

thanks a lot for your quick reply!

In our platform file we have

communication method = poll

And I can confirm that we configured the entire workflow for using the polling method, due to the special set-up (firewall problems) on the machine we are using.

So, if I understand correctly, all broadcasts should fail then? And it makes a lot of sens now that you mention it.

But it also confuses me, since … they work. And this one worked until I introduced another previous one (if I remove it, it restarts working).

Also, it works from command line: could it be that it just fails silently?

Stella

So, if I understand correctly, all broadcasts should fail then?

All broadcasts sent from the remote platform should fail.

Broadcasts made locally to the scheduler will still work.

E.g:

[scheduling]
  [[graph]]
    R1 = local & remote

[runtime]
  [[local]]
    # this should work:
    script = cylc broadcast "$CYLC_WORKFLOW_ID" -s '[environment]PASS=1'
  [[remote]]
    platform = my-remote-platform
    # this will probably fail:
    script = cylc broadcast "$CYLC_WORKFLOW_ID" -s '[environment]FAIL=2'

This is a common problem with HPC systems, as a result, we provide a fallback option for push communication using SSH rather than TCP:

communication method = ssh

With this configuration, Cylc will try to SSH from the HPC, to the scheduler host. It will then use TCP to contact the scheduler.

This requires SSH to be setup from the HPC to the local platform, where normally we might only set it up in the other direction. It might be a bit more of a hassle to set up your user account to do this, however, it would enable push-based communication without coming into conflict with the firewall.

All Cylc commands can be invoked over this interface, including cylc broadcast.

More info here:

1 Like

Thanks Oliver for all this info!

I think that @gturek went through all of this before, and the polling was the only option left for us for some reason.

Tomorrow I will discuss with her again about it, but I think we will have to come up with a different workflow that avoids broadcasts from the remote platform … it is anyway very good to know why something fails :slight_smile:

Thanks a lot again,
Stella

1 Like

We should do something to make the error from commands like cylc broadcast more helpful in this situation.

1 Like