Cylc broadcast error: WorkflowStopped

Good afternoon,

I am seeking advice for debugging a broadcast.

In particular, I have a task that, after computing a variable, tries to send it to another task.

Let’s say that this is my task (named hindcast_recup_SST)

#!/bin/bash
set -euo pipefail 
# Count the number of files, that will equal the number of tasks needed to create superobs
n_task=$(ls -lrt ${MY_PATH} | wc -l) 
# Send the number of tasks necessary to create superobs task
cylc broadcast ${CYLC_WORKFLOW_ID}  -p ${CYLC_TASK_CYCLE_POINT} \
                                    -n  hindcast_create_superobs_SST \
                                    -s "[directives]--ntasks=${n_task}" 

hindcast_create_superobs_SST is the task where I need to set the ntasks directive.

Now the problem: the first task fails with the following error

WorkflowStopped: glo12_cylc/run1 is not running
2025-05-13T12:48:24Z CRITICAL - failed/ERR

And the error is caused by the broadcast, since if I remove it, the task succeeds.
I tried to run the broadcast by hand on the command line, and it works

$ cylc broadcast glo12_cylc/run1 -p 20241111 -n hindcast_create_superobs_SST -s “[directives]–ntasks=4”
Broadcast set:

  • [20241111/hindcast_create_superobs_SST] [directives]–ntasks=4

Now, I made many test, and I am quite at loss.

First of all, how does cycl check if a workflow is running? Normally I check my workflow it with cycl tui, and there is marked as running, but from the error message, I understand there must be some variable that is telling the opposite to cycl.

Second: how do I debug this? I tryed to run the workflow in debug, but I see just an error like

DEBUG - [20241111/hindcast_recup_SST/01:submitted] (polled)failed
INFO - [20241111/hindcast_recup_SST/01:submitted] setting implied output: started
DEBUG - [20241111/hindcast_recup_SST/01:submitted] (internal)started
INFO - [20241111/hindcast_recup_SST/01:submitted] => running
DEBUG - [20241111/hindcast_recup_SST/01:running] health: execution timeout=None, polling intervals=PT1M,3*PT2M,PT1M,…
INFO - [20241111/hindcast_recup_SST/01:running] => failed
WARNING - [20241111/hindcast_recup_SST/01:failed] did not complete the required outputs:
⨯ ┆ succeeded

I don’t know how to retrieve more information on the failure, I added a --debug to the broadcast command without much success. Also, I have previous broadcast in the workflow, and they work. It could be that some previous broadcast fails, and this one is just trapping the error?

Any help will be highly appreciated!

Best,
Stella

First of all, how does cycl check if a workflow is running?

Cylc workflows create a special file when they are started containing information about where the scheduler is running. You can find this file in ~/cylc-run/<workflow-id>/.service/contact. The file is removed when the scheduler shuts down.

All Cylc utilities, from cylc broadcast to cylc gui use this file to determine if the workflow is running.


Second: how do I debug this?

Cylc needs to keep track of the status of the jobs it submits. There are two options for this:

  • Push - Cylc jobs send messages back to the scheduler.
  • Pull - The scheduler routinely checks thejob.status file that each job creates.

Looking at this log snippet:

DEBUG - [20241111/hindcast_recup_SST/01:submitted] (polled)failed

The (polled) bit means that Cylc is using “pull” based communication (aka “polling”) for this job.

Unfortunately, that means you won’t be able to push data back to the workflow, e.g. using cylc broadcast.

Push communication is the default, however, Cylc will fallback to pull communication if:

So either:

  • Cylc has been configured to use pull communication.
  • Or, there’s something wrong with the setup.

To find out which, run the cylc config -i '[platforms]' and find the entry for the platform you are using to run this job. If it includes method = poll then Cylc has been configured for pull communication.

If the platform is not configured for pull communication, this section of the troubleshooting guide has some information on debugging:

Hi Oliver,

thanks a lot for your quick reply!

In our platform file we have

communication method = poll

And I can confirm that we configured the entire workflow for using the polling method, due to the special set-up (firewall problems) on the machine we are using.

So, if I understand correctly, all broadcasts should fail then? And it makes a lot of sens now that you mention it.

But it also confuses me, since … they work. And this one worked until I introduced another previous one (if I remove it, it restarts working).

Also, it works from command line: could it be that it just fails silently?

Stella

So, if I understand correctly, all broadcasts should fail then?

All broadcasts sent from the remote platform should fail.

Broadcasts made locally to the scheduler will still work.

E.g:

[scheduling]
  [[graph]]
    R1 = local & remote

[runtime]
  [[local]]
    # this should work:
    script = cylc broadcast "$CYLC_WORKFLOW_ID" -s '[environment]PASS=1'
  [[remote]]
    platform = my-remote-platform
    # this will probably fail:
    script = cylc broadcast "$CYLC_WORKFLOW_ID" -s '[environment]FAIL=2'

This is a common problem with HPC systems, as a result, we provide a fallback option for push communication using SSH rather than TCP:

communication method = ssh

With this configuration, Cylc will try to SSH from the HPC, to the scheduler host. It will then use TCP to contact the scheduler.

This requires SSH to be setup from the HPC to the local platform, where normally we might only set it up in the other direction. It might be a bit more of a hassle to set up your user account to do this, however, it would enable push-based communication without coming into conflict with the firewall.

All Cylc commands can be invoked over this interface, including cylc broadcast.

More info here:

1 Like

Thanks Oliver for all this info!

I think that @gturek went through all of this before, and the polling was the only option left for us for some reason.

Tomorrow I will discuss with her again about it, but I think we will have to come up with a different workflow that avoids broadcasts from the remote platform … it is anyway very good to know why something fails :slight_smile:

Thanks a lot again,
Stella

1 Like

We should do something to make the error from commands like cylc broadcast more helpful in this situation.

1 Like

For BSC MN5, at least, only the poll communication method works. TCP or SSH both time out/fail. So if you have a similar HPC/environment, where HPC has no connectivity back to login nodes, then it might be needed to keep using the poll method.

It’s perfectly possible to configure the network to allow push TCP comms back from HPC compute nodes, of course, but whether or not you can persuade the platform admins to do it might be another matter, unfortunately.

You might try to enlist the help of someone with power, to make the business case - efficient workflow orchestration makes for better use of the machine, and presumably your workflow products are important too.

If you are stuck with job polling (pull), you should still be able to achieve the same result by adding a local task to your worklfow.

Instead of:

  • remote task a broadcasts information to downstream tasks

Do this:

  • remote task a writes information to disk
  • local task b triggers off of a, reads the information from disk and broadcasts it to downstream tasks
    • (this might involve ssh to the remote platform, if the filesystem is different, but that’s OK - ssh must work in that direction because it’s used for job submission)

Hi Bruno (nice to hear from you :smiling_face:!)

We have a slightly more complicated set-up than MN5 (but fundamentally the same issue): we have our dedicated login node on the machine we use (MeteoFrance), were we set up the cylc instance. And this login is not seen from the compute nodes (you can’t connect back). So you’re right, the only method we can use is polling, all the rest fails.

@hilary.j.oliver Yes, I though about this intermediate set-up.
But for the moment I decided to slightly change the workflow to avoid those broadcast, I thought it was simpler and more linear (it just create a bit of duplication between the tasks). If at a certain point we can’t avoid to broadcast, I will implement this “workaround” though, thanks!

Best,
Stella

Cylc’s SSH communication method combined with SSH tunnelling (via the login node) might be an option in some difficult setups, but if the compute nodes can’t even see the login nodes, and TCP ports cannot be opened, then I think you’re stuck with polling :frowning: .

Unfortunately, it is hard to send data back to the scheduler in this case as there is no way of communicating with it.

However, there is one method you can use to send simple pre-defined outcomes back to the scheduler using cylc message. These messages are written into the job.status file so are polled by the scheduler along with the job status. You can’t send data back via this mechanism, however, you can send back pre-defined messages and map them onto triggers that can then be used in the workflow’s graph.

Tutorial: Message Triggers — Cylc 8.4.2 documentation
Example (with graph branching): Scheduling Configuration — Cylc 8.4.2 documentation

1 Like