Kia ora! In our operational environment, Meteo France has the purview to swap where our pipelines are running at any time between two systems named belenos and taranis (see gallic gods). I need to come up with a new procedure for cylc workflows.
The two systems do not share any filesystem at all.
There are rsync jobs happening at midnight every evening between the two systems.
I have parametrized “platform” in flow.cylc, but this is obviously the “easy” bit. I suspect the cylc db is the hard part…
Thanx in advance.
Gaby
Hi,
Switching the platform is the easy bit, no need to parameterise the tasks, you can just change the platform on-the-fiy:
# switch to belenos
$ cylc broadcast <workflow> -s 'platform = belenos'
# switch to taranis
$ cylc broadcast <workflow> -s 'platform = taranis'
The filesystem part is the tricky bit that’s more bespoke to the workflow. These rsync tasks are like “checkpoints”, when switching platform, you will likely want to roll back the workflow to the previous checkpoint.
E.G, say we have this workflow:
[scheduling]
[[graph]]
P1D = """
a => b => c => checkpoint
"""
[runtime]
[[PIPELINE]]
[[a, b, c]]
inherit = PIPELINE
[[checkpoint]]
If the workflow is running task “2026-03-12/b” at the time of the switch, then you might want to do the following:
# switch platform to belenos
$ cylc broadcast <workflow> -s 'platform = belenos'
# re-run the pipeline
# (this will kill the running task and restart the pipeline from task "a")
$ cylc trigger <workflow>//20260312/PIPELINE
Gaby, @oliver.sanders response seems to assume that:
- you want to leave the workflow (scheduler) running on the one platform but get it to submit its jobs to the other platform
- There are rsync “checkpoint” jobs in your workflows, to syncronize the files needed to continue at that point in the workflow.
But from my read of your post, it sounds like you want to shift the scheduler itself to the other platform; and that that rsync jobs are some kind of general disk sync for all users, not part of the workflow - which makes it more difficult to switch the workflows over because they will not know exactly when the disk state on the other end is compatible with the workflow state.
Can you clarify what your requirements are exactly?
Hi Hilary. The scheduler will remain put on the same host, it’s just the run host that changes.
The rsync is currently a cron job and it’s done for the whole of the operational account, no one else.
If it would make more sense to do the rsync as a task I should probably look into into it, it would just need to rsync the cylc-run dirs, and those dirs that are not part of the cylc managed directory tree between run hosts
However, see also my answer to OIiver
: the swaps happen at random times, so I can’t quite see how rsync as part of the workflow would help… Though it would happen more often (once per cycle)
Thanx Oliver. The broadcast trick is cool, but we use more than one platform definition though as we do use login nodes for trivial task such as copying, etc. I could define platform families and broadcast to those, I suppose.
As for the checkpoint, I am not sure I understand how it would work if the switches happen at totally random times, whenever Meteo France needs to do so for their own reasons.
I could define platform families and broadcast to those, I suppose.
That’s what we do.
As for the checkpoint, I am not sure I understand how it would work if the switches happen at totally random times, whenever Meteo France needs to do so for their own reasons.
In production, we handle similar situations via manual intervention.
Automating this is possible, but hard to advise on as the required intervention is likely specific to the particular workflow and circumstance. Some tools which may be useful depending on the situation:
- The
:submit-failedtask output occurs when Cylc is unable to submit a task to a platform, e.g, because it has been taken down. This event can be used to kick off tasks or scripts which could do things like remove, trigger or expire a pipeline of tasks. - When you specify
'*'as the cycle in Cylc commands, it expands to all active cycles. This can provide a convenient way to intervene with all active instances of tasks. E.g, this command would force thePIPELINEto re-run in all active cycles:cylc trigger '<workflow>//*/PIPELINE. - If the requirements are complex, the
cylc dump -tcommand can be used to display all of the active tasks present in the workflow for use in automation scripts.
Thanx Oliver! Just to recap, we can have two scenarios then:
- I rsync the latest changes
- I stop the workflow
- I broadcast change in platform
- I restart the workflow from the last task(s) it was running when I stopped it.
~ OR ~
- workflow fails because platform went away
- I broadcast change in platform
- I retrigger the FAMILY the task(s) that failed belonged to
For the second case, the rsync won’t be able to happen, so we want to make sure we capture any missed logs, outputs, etc since the last rsync
Is this right?
I rsync the latest changes
I stop the workflow
I broadcast change in platform
I restart the workflow from the last task(s) it was running when I stopped it.
Could work, but you don’t need to stop and restart the workflow. When you run “cylc broadcast”, the changes are picked up by the workflow instantly, all new job submissions after this point will use the new config.
Question:
“Meteo France has the purview to swap where our pipelines are running”
- Does this mean they will ask you to switch platform?
- Or are they doing this at a higher level?
- When the switch happens, can existing pieline runs continue on the old platform, or do you need to kill them on the old platform and re-run them on the new platform?
If it would make more sense to do the rsync as a task I should probably look into into it
It would make this platform-change intervention easier if the rsync is done within the workflow.
E.g, something along the lines of:
[scheduling]
[[graph]]
P1D = a => b => c => checkpoint
[runtime]
[[root]]
[[[environment]]]
primary_platform = belenos
backup_platform = taranis
[[PRIMARY_PLATFORM]]
platform = belenos
[[PIPELINE]]
[[a, b, c]]
inherit = PRIMARY_PLATFORM, PIPELINE
[[checkpoint]]
inherit = PRIMARY_PLATFORM
script = rsync "$primary_platform:$CYLC_TASK_SHARE_CYCLE_DIR" "$backup_platform:$CYLC_TASK_SHARE_CYCLE_DIR"
Especially if the existing pipeline run can continue on the old platform, then the approach of starting every cycle with a “start” task can be beneficial.
This “start” task would determine what platform to run the cycle on, then broadcast that to all subsequent tasks in the cycle:
[scheduling]
[[graph]]
P1D = start => a => b => c => checkpoint
[runtime]
[[start]]
script = """
primary_platform="$(what-is-the-primary-platform)"
backup_platform="$(what-is-the-backup-platform)"
cylc broadcast \
"${CYLC_WORKFLOW_ID}" \
-p "${CYLC_TASK_CYCLE_POINT}" \
-s "[root][environment]primary_platform=$primary_platform" \
-s "[root][environment]backup_platform=$backup_platform" \
-s "[PRIMARY_PLATFORM]platform=$primary_platform"
"""
However, if the existing pipeline run must be aborted on the old platform (or is aborted for you), then you might want to write a script which broadcasts the platforms, then re-runs any active pipelines on the new platform.
Here’s a naive example of how that might look:
cylc broadcast "<workflow>" ...
for cycle in $(cylc dump generic -t | grep --color=never '\/[abc]' | grep --color=never -o '^[0-9T]*'); do
cylc trigger "<workflow>//$cycle/PIPELINE
done
Hi Oliver, replying to youre questions:
- They do the switch themselves, and it can happen at any time. In particular, if it happens on the weekends when we are not around, if whatever was running does not restart properly it has to wait for our own intervention on monday.
- Nothing is left running on the old platform, it has to pick up on the new platform from where it left off, or (currently) restart anew if it’s particularly messed up (our current pipelines all run via shell scripts).
So, we will need to provide them with a simple procedure on how to proceed.
In a “programmed” switch scenario then your example is what we want so we will need to come up with a script that does that.
In case of a catastrophic failure, my second scenario would be the use case.
Ok, so how do they achieve this?
- Running jobs are killed?
And how are you notified:
- Running jobs die and new jobs submit-fail?
- Via an event we could hook into?
- Via a service we can poll?
- Email?
I will need to find out the specifics from our oper guys, but from what I’ve seen so far they (Meteo France operators) stop everything and relaunch. We get an email saying there was a platform switch… then we have to check if something broke (very fragile pipelines)
A Cylc run host is where the scheduler runs, not where the task jobs run. (There’s an associated global config item with the same name too.)
So I guess you mean that the scheduler run host does not change, but the job platform(s) does change.
This is why I originally thought you meant the scheduler run host changes. “Where our workflows (/pipelines) are running” would typically be taken to mean where the schedulers are running to manage your workflows, not where the workflows are submitting task jobs to run.
(Just to get us on the same terminological page, for the ongoing discussion!).
Hi there, so I checked with my oper guys. Except for catastrophic failures, we get an email that MF has switched jobs platform. It is us who perform the switch for our workflows ASAP, meaning that if this happens on holidays, nights, weekends, etc when there’s no one on our side to take care of it, our pipelines keep going until someone can intervene (next morning, next monday, etc).
Ok, from that, it sounds like a solution along these lines would work best for you:
[scheduling]
[[graph]]
P1D = start => a => b => c => checkpoint
[runtime]
[[start]]
script = """
primary_platform="$(what-is-the-primary-platform)"
backup_platform="$(what-is-the-backup-platform)"
cylc broadcast \
"${CYLC_WORKFLOW_ID}" \
-p "${CYLC_TASK_CYCLE_POINT}" \
-s "[root][environment]primary_platform=$primary_platform" \
-s "[root][environment]backup_platform=$backup_platform" \
-s "[PRIMARY_PLATFORM]platform=$primary_platform"
"""
This is an automated solution, once started, a pipeline will continue on the same platform. When you are given notice of a platform swap, update the interface you use to tell the workflow which platform to submit to (e.g, could be a text file).