Running Cylc on TGCC’s Irene – Handling Node Changes Between Job Resubmissions

Dear Cylc support team,

We are working together with the TGCC support team to find a way to run Cylc on the Irene supercomputer, which has specific security policies that require us to use Cylc in a rather unusual way.

Context:

TGCC does not allow Cylc to run persistently in the background. As a result, we are forced to run Cylc within a batch job. Additionally, due to job time limits, we must resubmit the job every 24 hours, which means stopping the workflow and restarting it in a new job.

The main issue we’re facing is that when the job is resubmitted, it often runs on a different compute node, and the new instance of Cylc cannot connect to tasks still running on the previous node.

Example Script:

Here is a simplified example of the job script we’re using:

#!/bin/bash
#MSUB -T 180 #600
#MSUB -c 1
#MSUB -q rome
#MSUB -m scratch,work,store
#MSUB -o logs/test_reload_cylc%j.out
#MSUB -e logs/test_reload_cylc%j.err
###MSUB -A <projID>
###MSUB -Q <QOS>

# 1. Setup environment Modules
ml pu
module load cylc/8.3.4

STOP_ON_TIME=100 # time might need to be adapted!
WORKFLOW_DIR=$(dirname "$PWD")
WORKFLOW_NAME=$(basename "$PWD") # use the the current folder as the workflow name 
cd "$WORKFLOW_DIR/$WORKFLOW_NAME" ### check whether this should be here!!!!!


# 2. Prepare workflow on the first submission
if [[ $FIRST_TIME != 0 ]]; then
  cylc validate .
  cylc install
fi

# 3. Play workflow
cylc play $WORKFLOW_NAME &

# 4. Stop and Clean workflow before the end of the bridge job
date
echo "Cylc will be stopped at ${STOP_ON_TIME}s before the end of the job."
sleep $(($BRIDGE_MSUB_MAXTIME - $STOP_ON_TIME)) && cylc stop --now ${WORKFLOW_NAME}

date

# 5. Re-sub the bridge job
echo "Relaunch now:"
FIRST_TIME=0 ccc_msub test_reload_cylc.sh >> logs/test_reload_cylc.out 2>&1

Observations:

This method works if the job is resubmitted on the same node, but this is not feasible in a production environment.

We’ve tried a few mitigations, such as:

  • Running cylc play in the background (&) to make the call non-blocking.

  • Cleaning ‘cylc clean $WORKFLOW_NAME’ before relaunching

  • Cleaning and removing contact file 'cylc clean ${WORKFLOW_NAME} -y --rm=“.service/contact” ’

  • removing .service/contact to avoid Cylc trying to reconnect to the old node.

Workflow Configuration:

We are using Cylc 8.3.4 with a very basic workflow defined in flow.cylc:

[scheduler]
    allow implicit tasks = True
[scheduling]
    cycling mode = integer
    initial cycle point = 1
    [[graph]]
        P1 = """ 
            A => B
            B[-P1] => A 
            """
[runtime]
    [[A]]
        script = """
                    date
                    sleep 11
                    date
                """

    [[B]]
        script = """
                    date
                    sleep 22
                    date
                """

Request:

We would appreciate any guidance on how to make Cylc workflows more resilient in this constrained setup — particularly in how to resume or reconnect workflows across different nodes, or if there’s a better approach for such environments.

Thank you in advance for your help!

Best regards,
Chloé Azria from IPSL - Plateforme Group

Hi :wave:,


specific security policies that require us to use Cylc in a rather unusual way.

That is an unusual way of deploying Cylc!

Usually sites either:

  1. Run the workflows on a server outside of the HPC.
    • This server could be anywhere, it doesn’t have to be within the HPC.
    • The only requirement is that you are able to SSH from this server, to the HPC login nodes to submit jobs.
  2. Run workflows on the login node (not the compute node).
    • Sites generally permit long-running processes on the login nodes, but not on the compute nodes.

If either of these options are possible they will be greatly preferable as the restart batch job chain is going to be awkward and fragile. We have experience of a setup where SSH is possible but requires authentication via an external tool which is something we have been able to make work. Happy to advise as best we’re able.


Here is a simplified example of the job script we’re using:

In this example workflow, the two tasks A and B are submitted as local background jobs. This means that they will run within the same batch job that the Cylc scheduler is itself running in.

When you stop the workflow with cylc stop --now, you stop the Cylc scheduler but leave the jobs running. These Cylc jobs will presumably get killed when the batch system job exits?

When you restart the workflow, Cylc will try to check the status of these jobs. Because these are background jobs (not submitted to a batch system), Cylc will have to SSH into the login node to do this which is where your problems are coming from.

This approach couples the lifecycle of the jobs Cylc submits to that of the Cylc scheduler itself (i.e. Cylc jobs cannot outlive the scheduler). Presumably these jobs get killed by the batch system after the Cylc scheduler job exits?

To work around this issue, either:

  1. Wait for the local jobs to finish before stopping the scheduler:
    • If you change the script to run cylc stop (without the --now), then the Cylc scheduler will wait for these jobs to finish before exiting.
    • This removes the need to access the previous compute node on restart.
  2. Submit the Cylc jobs to the batch system:
    • Normally, Cylc workflows submit their jobs to the batch system itself.
    • If you are able to submit jobs from the compute node, this would be a much better way to go as the lifecycle of the jobs Cylc submits would be uncoupled from the lifecycle of the Cylc scheduler.

Cleaning ‘cylc clean $WORKFLOW_NAME’ before relaunching

The cylc clean command will delete the workflow installation and all the files it has created.

This shouldn’t be necessary.


Cleaning and removing contact file 'cylc clean ${WORKFLOW_NAME} -y --rm=“.service/contact” ’

removing .service/contact to avoid Cylc trying to reconnect to the old node.

The Cylc scheduler creates the .service/contact file when it starts up and deletes this file when it shuts down.

If a contact file is present, it either means that the Cylc scheduler is running or that something has killed the Cylc scheduler in a way that prevented it from removing this file (e.g. kill -9 or pulling the power cord out of the compute node).

This shouldn’t be necessary.


Hope this helps, Oliver
1 Like

A small observation - you don’t need to background cylc play with & because (without the --no-detach option) it automatically runs as a daemon - i.e. not just in the background, but detached from the parent process.

1 Like

I have a little experience running Cylc in a similar environment - inside time-limited Slurm batch jobs, on compute nodes. This mostly just restates Oliver’s advice, but in case it helps…

The contact file will not be left behind if the Cylc scheduler shuts down cleanly. But if your shutdown time estimate is off and the batch system kills it, the contact file will be left behind, and that will be a problem for restarting the workflow: at restart Cylc will try to ssh to the original compute node (as recorded in the contact file) to see if the original process is still running there. If user ssh to compute nodes is disallowed, that will fail, and you will have to manually delete the contact file.

Your test workflow uses the default “background” job runner, which runs jobs directly on the scheduler host. These jobs get recorded, by Cylc, as running on “localhost”, so there’s a built-in assumption that the scheduler is not going to move to another host while background jobs are running. If you restart the scheduler on another host, it is going to look for those jobs on the new host and conclude that they must have failed.

Can you ssh from compute nodes to a login node, to submit other batch jobs from inside a running batch job? If so, you should define an appropriate Cylc platform (global config) and have your workflow submit jobs to the batch system instead of background. Then on restart, Cylc can query the batch system to find orphaned jobs wherever they are located (note for orphaned jobs that are no longer listed in the batch system, this also involves ssh to the platform hosts to check the Cylc-generated job status file - but not to the compute nodes).

All things considered, it would be better to run Cylc schedulers on an HPC login node (if that cause load issues, then your admins might just have to provide more resources :slight_smile: and avoid all of these problems. Or even better, as Oliver notes, an off-side VM that has access to the HPC.

If you are truly stuck with your current setup and background jobs, you might have to shut down Cylc prior to the end of your batch allocation with cylc stop --kill to kill any running jobs. As noted above, Cylc won’t be able to find orphaned background jobs if you restart on another host, so rerunning those jobs from scratch is likely your only option short of “manually” waiting for them to finish before your restart, then (after restart) manually (with cylc set) setting them to succeeded to allow the workflow to continue. [Note next comment below; waiting for the jobs to finish only applies if you shut down before the batch allocation ends, with time left for them to finish.]

1 Like

(Background jobs will be killed anyway when the batch allocation ends, but the difference is Cylc will know they’re dead at restart, so it won’t have to try to determine whether they’re still running or not).

1 Like

Thank you very much for your answers.
We have made some tests to find a solution.
We have found that the running task is not killed when the job is killed, which is what we want in our system.
It appears that the problem is that cylc stop --now is non blocking and the stop could be not yet finished when the job is killed.
This method works if we check that the file “.service/contact” exists and we finish the job only when it is removed.

I think we will have some other issues due to this method of running cylc but we at least can run the workflow now.

Chloé

I presume by “the job” here you mean the scheduler?

The terminology gets confusing when we run schedulers inside batch jobs, because in Cylc a “job” is a process submitted to run by a task (via a task job script).

Either way I don’t think I understand your statement. If you’re running local background task jobs inside a Slurm job allocation, and that allocation ends, is it not the case that the background task job gets killed by Slurm at the same time that the scheduler gets killed? (Or does the fact that the background task jobs are detached from the scheduler somehow hide them from Slurm??)

Well that’s true, if by “blocking” you mean waiting for task jobs to finish before scheduler shut down.

cylc stop (no options) is blocking though; and cylc stop --kill (which I recommended above) is also blocking, I think.

Good that you’ve found a solution, but it seems to me that watching the contact file shouldn’t be necessary if you use stop --kill.

By job, I was referring the Slurm task.
As you mentioned, the cylc task jobs seem to be somewhat hidden from Slurm.

By “blocking,” I mean that the process waits for the task to finish before returning control to the prompt.

We don’t want to use cylc stop alone, because it would wait for the cylc task job to complete — and that could take hours.
We don’t want to use cylc stop --kill either, because we want the cylc task job to continue running, so it can later be picked up by a new Slurm job.

What we want is for cylc stop --now to complete before the Slurm job ends.

I hope that makes things clearer.

Chloé

1 Like