Dear Cylc support team,
We are working together with the TGCC support team to find a way to run Cylc on the Irene supercomputer, which has specific security policies that require us to use Cylc in a rather unusual way.
Context:
TGCC does not allow Cylc to run persistently in the background. As a result, we are forced to run Cylc within a batch job. Additionally, due to job time limits, we must resubmit the job every 24 hours, which means stopping the workflow and restarting it in a new job.
The main issue we’re facing is that when the job is resubmitted, it often runs on a different compute node, and the new instance of Cylc cannot connect to tasks still running on the previous node.
Example Script:
Here is a simplified example of the job script we’re using:
#!/bin/bash
#MSUB -T 180 #600
#MSUB -c 1
#MSUB -q rome
#MSUB -m scratch,work,store
#MSUB -o logs/test_reload_cylc%j.out
#MSUB -e logs/test_reload_cylc%j.err
###MSUB -A <projID>
###MSUB -Q <QOS>
# 1. Setup environment Modules
ml pu
module load cylc/8.3.4
STOP_ON_TIME=100 # time might need to be adapted!
WORKFLOW_DIR=$(dirname "$PWD")
WORKFLOW_NAME=$(basename "$PWD") # use the the current folder as the workflow name
cd "$WORKFLOW_DIR/$WORKFLOW_NAME" ### check whether this should be here!!!!!
# 2. Prepare workflow on the first submission
if [[ $FIRST_TIME != 0 ]]; then
cylc validate .
cylc install
fi
# 3. Play workflow
cylc play $WORKFLOW_NAME &
# 4. Stop and Clean workflow before the end of the bridge job
date
echo "Cylc will be stopped at ${STOP_ON_TIME}s before the end of the job."
sleep $(($BRIDGE_MSUB_MAXTIME - $STOP_ON_TIME)) && cylc stop --now ${WORKFLOW_NAME}
date
# 5. Re-sub the bridge job
echo "Relaunch now:"
FIRST_TIME=0 ccc_msub test_reload_cylc.sh >> logs/test_reload_cylc.out 2>&1
Observations:
This method works if the job is resubmitted on the same node, but this is not feasible in a production environment.
We’ve tried a few mitigations, such as:
-
Running cylc play in the background (&) to make the call non-blocking.
-
Cleaning ‘cylc clean $WORKFLOW_NAME’ before relaunching
-
Cleaning and removing contact file 'cylc clean ${WORKFLOW_NAME} -y --rm=“.service/contact” ’
-
removing .service/contact to avoid Cylc trying to reconnect to the old node.
Workflow Configuration:
We are using Cylc 8.3.4 with a very basic workflow defined in flow.cylc:
[scheduler]
allow implicit tasks = True
[scheduling]
cycling mode = integer
initial cycle point = 1
[[graph]]
P1 = """
A => B
B[-P1] => A
"""
[runtime]
[[A]]
script = """
date
sleep 11
date
"""
[[B]]
script = """
date
sleep 22
date
"""
Request:
We would appreciate any guidance on how to make Cylc workflows more resilient in this constrained setup — particularly in how to resume or reconnect workflows across different nodes, or if there’s a better approach for such environments.
Thank you in advance for your help!
Best regards,
Chloé Azria from IPSL - Plateforme Group