Handling run host swaps in operational environment

gturek · March 12, 2026, 9:52am

Kia ora! In our operational environment, Meteo France has the purview to swap where our pipelines are running at any time between two systems named belenos and taranis (see gallic gods). I need to come up with a new procedure for cylc workflows.
The two systems do not share any filesystem at all.
There are rsync jobs happening at midnight every evening between the two systems.
I have parametrized “platform” in flow.cylc, but this is obviously the “easy” bit. I suspect the cylc db is the hard part…
Thanx in advance.
Gaby

oliver.sanders · March 12, 2026, 11:44am

Hi,

Switching the platform is the easy bit, no need to parameterise the tasks, you can just change the platform on-the-fiy:

# switch to belenos
$ cylc broadcast <workflow> -s 'platform = belenos'

# switch to taranis
$ cylc broadcast <workflow> -s 'platform = taranis'

The filesystem part is the tricky bit that’s more bespoke to the workflow. These rsync tasks are like “checkpoints”, when switching platform, you will likely want to roll back the workflow to the previous checkpoint.

E.G, say we have this workflow:

[scheduling]
  [[graph]]
    P1D = """
        a => b => c => checkpoint
    """

[runtime]
  [[PIPELINE]]
  [[a, b, c]]
    inherit = PIPELINE
  [[checkpoint]]

If the workflow is running task “2026-03-12/b” at the time of the switch, then you might want to do the following:

# switch platform to belenos
$ cylc broadcast <workflow> -s 'platform = belenos'

# re-run the pipeline
# (this will kill the running task and restart the pipeline from task "a")
$ cylc trigger <workflow>//20260312/PIPELINE

hilary.j.oliver · March 13, 2026, 2:04am

Gaby, @oliver.sanders response seems to assume that:

you want to leave the workflow (scheduler) running on the one platform but get it to submit its jobs to the other platform
There are rsync “checkpoint” jobs in your workflows, to syncronize the files needed to continue at that point in the workflow.

But from my read of your post, it sounds like you want to shift the scheduler itself to the other platform; and that that rsync jobs are some kind of general disk sync for all users, not part of the workflow - which makes it more difficult to switch the workflows over because they will not know exactly when the disk state on the other end is compatible with the workflow state.

Can you clarify what your requirements are exactly?

gturek · March 13, 2026, 1:51pm

Hi Hilary. The scheduler will remain put on the same host, it’s just the run host that changes.
The rsync is currently a cron job and it’s done for the whole of the operational account, no one else.
If it would make more sense to do the rsync as a task I should probably look into into it, it would just need to rsync the cylc-run dirs, and those dirs that are not part of the cylc managed directory tree between run hosts
However, see also my answer to OIiver : the swaps happen at random times, so I can’t quite see how rsync as part of the workflow would help… Though it would happen more often (once per cycle)

gturek · March 13, 2026, 2:00pm

Thanx Oliver. The broadcast trick is cool, but we use more than one platform definition though as we do use login nodes for trivial task such as copying, etc. I could define platform families and broadcast to those, I suppose.

As for the checkpoint, I am not sure I understand how it would work if the switches happen at totally random times, whenever Meteo France needs to do so for their own reasons.

oliver.sanders · March 13, 2026, 2:49pm

I could define platform families and broadcast to those, I suppose.

That’s what we do.

As for the checkpoint, I am not sure I understand how it would work if the switches happen at totally random times, whenever Meteo France needs to do so for their own reasons.

In production, we handle similar situations via manual intervention.

Automating this is possible, but hard to advise on as the required intervention is likely specific to the particular workflow and circumstance. Some tools which may be useful depending on the situation:

The :submit-failed task output occurs when Cylc is unable to submit a task to a platform, e.g, because it has been taken down. This event can be used to kick off tasks or scripts which could do things like remove, trigger or expire a pipeline of tasks.
When you specify '*' as the cycle in Cylc commands, it expands to all active cycles. This can provide a convenient way to intervene with all active instances of tasks. E.g, this command would force the PIPELINE to re-run in all active cycles: cylc trigger '<workflow>//*/PIPELINE.
If the requirements are complex, the cylc dump -t command can be used to display all of the active tasks present in the workflow for use in automation scripts.

gturek · March 13, 2026, 3:11pm

Thanx Oliver! Just to recap, we can have two scenarios then:

I rsync the latest changes
I stop the workflow
I broadcast change in platform
I restart the workflow from the last task(s) it was running when I stopped it.

~ OR ~

workflow fails because platform went away
I broadcast change in platform
I retrigger the FAMILY the task(s) that failed belonged to

For the second case, the rsync won’t be able to happen, so we want to make sure we capture any missed logs, outputs, etc since the last rsync

Is this right?

oliver.sanders · March 13, 2026, 3:47pm

I rsync the latest changes
I stop the workflow
I broadcast change in platform
I restart the workflow from the last task(s) it was running when I stopped it.

Could work, but you don’t need to stop and restart the workflow. When you run “cylc broadcast”, the changes are picked up by the workflow instantly, all new job submissions after this point will use the new config.

Question:

“Meteo France has the purview to swap where our pipelines are running”

Does this mean they will ask you to switch platform?
Or are they doing this at a higher level?
When the switch happens, can existing pieline runs continue on the old platform, or do you need to kill them on the old platform and re-run them on the new platform?

If it would make more sense to do the rsync as a task I should probably look into into it

It would make this platform-change intervention easier if the rsync is done within the workflow.

E.g, something along the lines of:

[scheduling]
  [[graph]]
    P1D = a => b => c => checkpoint

[runtime]
  [[root]]
    [[[environment]]]
      primary_platform = belenos
      backup_platform = taranis

  [[PRIMARY_PLATFORM]]
    platform = belenos

   [[PIPELINE]]

  [[a, b, c]]
    inherit = PRIMARY_PLATFORM, PIPELINE

  [[checkpoint]]
    inherit = PRIMARY_PLATFORM
    script = rsync "$primary_platform:$CYLC_TASK_SHARE_CYCLE_DIR" "$backup_platform:$CYLC_TASK_SHARE_CYCLE_DIR"

Especially if the existing pipeline run can continue on the old platform, then the approach of starting every cycle with a “start” task can be beneficial.

This “start” task would determine what platform to run the cycle on, then broadcast that to all subsequent tasks in the cycle:

[scheduling]
  [[graph]]
    P1D = start => a => b => c => checkpoint

[runtime]
  [[start]]
    script = """
      primary_platform="$(what-is-the-primary-platform)"
      backup_platform="$(what-is-the-backup-platform)"
      cylc broadcast \
        "${CYLC_WORKFLOW_ID}" \
        -p "${CYLC_TASK_CYCLE_POINT}" \
        -s "[root][environment]primary_platform=$primary_platform" \
        -s "[root][environment]backup_platform=$backup_platform" \
        -s "[PRIMARY_PLATFORM]platform=$primary_platform"
    """

However, if the existing pipeline run must be aborted on the old platform (or is aborted for you), then you might want to write a script which broadcasts the platforms, then re-runs any active pipelines on the new platform.

Here’s a naive example of how that might look:

cylc broadcast "<workflow>" ...

for cycle in $(cylc dump generic -t  | grep --color=never '\/[abc]' | grep --color=never -o '^[0-9T]*'); do
    cylc trigger "<workflow>//$cycle/PIPELINE
done

gturek · March 13, 2026, 4:47pm

Hi Oliver, replying to youre questions:

They do the switch themselves, and it can happen at any time. In particular, if it happens on the weekends when we are not around, if whatever was running does not restart properly it has to wait for our own intervention on monday.
Nothing is left running on the old platform, it has to pick up on the new platform from where it left off, or (currently) restart anew if it’s particularly messed up (our current pipelines all run via shell scripts).

So, we will need to provide them with a simple procedure on how to proceed.
In a “programmed” switch scenario then your example is what we want so we will need to come up with a script that does that.

In case of a catastrophic failure, my second scenario would be the use case.

oliver.sanders · March 13, 2026, 4:50pm

Ok, so how do they achieve this?

Running jobs are killed?

And how are you notified:

Running jobs die and new jobs submit-fail?
Via an event we could hook into?
Via a service we can poll?
Email?

gturek · March 13, 2026, 4:59pm

I will need to find out the specifics from our oper guys, but from what I’ve seen so far they (Meteo France operators) stop everything and relaunch. We get an email saying there was a platform switch… then we have to check if something broke (very fragile pipelines)

hilary.j.oliver · March 14, 2026, 5:54am

A Cylc run host is where the scheduler runs, not where the task jobs run. (There’s an associated global config item with the same name too.)

So I guess you mean that the scheduler run host does not change, but the job platform(s) does change.

This is why I originally thought you meant the scheduler run host changes. “Where our workflows (/pipelines) are running” would typically be taken to mean where the schedulers are running to manage your workflows, not where the workflows are submitting task jobs to run.

(Just to get us on the same terminological page, for the ongoing discussion!).

gturek · March 17, 2026, 8:03am

Hi there, so I checked with my oper guys. Except for catastrophic failures, we get an email that MF has switched jobs platform. It is us who perform the switch for our workflows ASAP, meaning that if this happens on holidays, nights, weekends, etc when there’s no one on our side to take care of it, our pipelines keep going until someone can intervene (next morning, next monday, etc).

oliver.sanders · March 17, 2026, 10:49am

Ok, from that, it sounds like a solution along these lines would work best for you:

[scheduling]
  [[graph]]
    P1D = start => a => b => c => checkpoint

[runtime]
  [[start]]
    script = """
      primary_platform="$(what-is-the-primary-platform)"
      backup_platform="$(what-is-the-backup-platform)"
      cylc broadcast \
        "${CYLC_WORKFLOW_ID}" \
        -p "${CYLC_TASK_CYCLE_POINT}" \
        -s "[root][environment]primary_platform=$primary_platform" \
        -s "[root][environment]backup_platform=$backup_platform" \
        -s "[PRIMARY_PLATFORM]platform=$primary_platform"
    """

This is an automated solution, once started, a pipeline will continue on the same platform. When you are given notice of a platform swap, update the interface you use to tell the workflow which platform to submit to (e.g, could be a text file).

Topic		Replies	Views
[runtime][<task name>][job]batch system equivalent in cylc 8.3.4 Cylc Support	14	170	November 14, 2024
Running Cylc on TGCC’s Irene – Handling Node Changes Between Job Resubmissions Cylc Support	7	108	June 4, 2025
Restarting failed workflow Cylc Support	35	887	September 26, 2022
Setting up a Platform Config for machines that share a home directory Cylc 8 Migration	35	1343	March 17, 2022
How to configure remote run host with mutliple job runners Cylc Support	4	63	November 8, 2024

Handling run host swaps in operational environment

Related topics