Cylc8 migration issues

Hello, I’m finally gettin round to migrating to cylc8, but so far I haven’t been able to get even the simplest tasks working.

The documentation is rather sparse on trobleshooting and also makes it hard to figure out simple tasks like how to set up a platform configuration. So I resorted to guesswork and put this in my ~/.cylc/flow/global.cylc

[platforms]
    [[cluster-slurm]]
        hosts = metoc-cl4
        job runner = slurm
        retrieve job logs = True
        install target = localhost

    [[cluster-background]]
        hosts = metoc-cl4
        job runner = background
        retrieve job logs = True
        install target = localhost

I basically have two types of tasks, those that use slurm (to run on the compute nodes) and those tiny shell scripts that run on the headnode.

I can’t get even to the point where the task is executed, let alone submitted using slurm.

With some guesswork I pieced together the installation (documentation is badly lacking here) and installed a cylc wrapper script using cylc get-resources cylc /usr/local/bin/ and then did linked /usr/local/bin/rose to point to the cylc script. (That bit is missing in the documenation). So now it finds the cylc and rose commands when it runs on the headnode.

I ran cylc vip suitename and inspected the log directory on the cluster headnode. The documenation says I should be inspecting the “workflow log” but there’s no such thing (I guess it’s been renamed, and the docs haven’t kept up to point out where to find this).

the job.err for a submit-failed (using platform cluster-background) jobs says

/home/model/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/build_uscc_engine/02/job: line 89: /home/model/cylc-run/downloader_suite_cl4/ru
n8/.service/etc/job.sh: No such file or directory
/home/model/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/build_uscc_engine/02/job: line 90: cylc__job__main: command not found

which is odd, it complains about not finding something I have no control over.

BTW, I had it past this point previously but forgot what I did, so cannot replicate it.

the logs for a job using platform cluster-slurm only has the job and the job.status file
the job.status of this job simply reads:

CYLC_JOB_RUNNER_NAME=slurm

and no other log files are created.

I then inspected the cylc cat-log output on the cylc host, and see this:

2023-08-24T21:05:36Z INFO - Scheduler: url=tcp://cylcsrvr2:43015 pid=2900868
2023-08-24T21:05:36Z INFO - Workflow publisher: url=tcp://cylcsrvr2:43007
2023-08-24T21:05:36Z INFO - Run: (re)start number=1, log rollover=3
2023-08-24T21:05:36Z INFO - Cylc version: 8.2.0
2023-08-24T21:05:36Z INFO - Run mode: live
2023-08-24T21:05:36Z INFO - Initial point: 20230818T0000Z
2023-08-24T21:05:36Z INFO - Final point: 30000101T0000Z
2023-08-24T21:11:27Z CRITICAL - [20230822T0000Z/get_cmems_medwam_analysis_00 preparing job:10 flows:1] submission failed
2023-08-24T21:11:27Z INFO - [20230822T0000Z/get_cmems_medwam_analysis_00 preparing job:10 flows:1] => waiting
2023-08-24T21:11:27Z WARNING - [20230822T0000Z/get_cmems_medwam_analysis_00 waiting job:10 flows:1] retrying in PT30S (after 2023-08-24T21:11:57Z)
2023-08-24T21:11:27Z ERROR - [jobs-submit cmd] ssh -oBatchMode=yes -oConnectTimeout=10 metoc-cl4 env CYLC_VERSION=8.2.0 bash --login -c ''"'"'exec "$
0" "$@"'"'"'' cylc jobs-submit --debug --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --pat
h=/usr/local/sbin -- '$HOME/cylc-run/downloader_suite_cl4/run8/log/job' 20230822T0000Z/get_cmems_medwam_analysis_00/10 20230822T0000Z/get_cmems_medwa
m_forecast_00/10
    [jobs-submit ret_code] 1
    [jobs-submit out] 2023-08-24T21:11:27Z|20230822T0000Z/get_cmems_medwam_forecast_00/10|1
2023-08-24T21:11:27Z DEBUG - [20230822T0000Z/get_cmems_medwam_forecast_00 preparing job:10 flows:1] (internal)submission failed at 2023-08-24T21:11:2
7Z
2023-08-24T21:11:27Z CRITICAL - [20230822T0000Z/get_cmems_medwam_forecast_00 preparing job:10 flows:1] submission failed

I don’t understand what’s going on.

Maybe someone can suggest a step-by-step procedure to debug this. I can’t see much that’s useful in the logs, so I’m really stuck.

Any help is much appreciated

Thanks,
Fred

A comment on the installation documentation - it mentions that conda cannot be trusted these days and one should use mamba. That works great but I need to modify the definition of CYLC_HOME_ROOT_ALT in the cylc wrapper script.

My installation procedure is now this:

# << as root >>
#-------------------------------------------------------------------------
# install cylc for root user, so we can create the /usr/local/bin/cylc command
#-------------------------------------------------------------------------

mkdir -p ~/Downloads/mamba
cd ~/Downloads/mamba

wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh

{ echo ''; echo yes; echo ''; echo yes; } | bash Mambaforge-Linux-x86_64.sh

# now re-open the shell, or type this:
source ~/.bashrc

yes | mamba install cylc-flow cylc-uiserver cylc-rose metomi-rose

# install the cylc wrapper script in a dircetory that's in the PATH by default
cylc get-resources cylc /usr/local/bin/
# edit the cylc wrapper script to look for its environment in the user's mambaforge environment (not miniconda)
sed -i 's/# CYLC_HOME_ROOT_ALT=\${HOME}\/miniconda3\/envs/CYLC_HOME_ROOT_ALT=\${HOME}\/mambaforge\/envs/g' /usr/local/bin/cylc

chmod +x /usr/local/bin/cylc
ln -sf /usr/local/bin/cylc /usr/local/bin/rose

I then repeat the same (up to mamba install) for the user model (who runs the suites), so this user also has the mamba envinroment to run the cylc command. I install cylc into the base environment.

Have I done this right?

My suite.rc is still largely unchanged from cylc7 and runs in compatability mode. I’m assuming the graph isn’t the issue causin a submit-fail, so I’ll report a few parts under the [runtime] section that might be relevant

[runtime]
    [[root]]
        script = rose task-run --verbose
        env-script = $( eval rose task-env )
        platform = cluster-slurm

        [[[environment]]]
            CFG_BUILD_BASE_DIR = ${ROSE_SUITE_DIR}/share
            ...

    [[HPC_SYSTEM]]
        pre-script = """
                ulimit -s unlimited
                module purge
                echo "module load {{ HPC_MODULES }}"
                module load {{ HPC_MODULES }} > /dev/null 2>&1
                module list 2>&1
                echo "sbatch=$(which sbatch)"
                echo "at=$(which at)"
                echo "HOSTNAME=$(hostname)"
                echo "PWD=$(pwd)"
                set +u  # avoid 'error PS1: unbound variable' in activate
                echo "mamba activate downloader_suite_cl4"
                source ~/mambaforge/bin/activate downloader_suite_cl4 &> ~/t
                RET=${?}
                if [[ ${RET} -ne 0 ]]; then
                    (>&2 echo "[ERROR] failed to activate downloader_suite_cl4 mamba environment")
                    exit ${RET}
                fi
                echo "mamba env downloader_suite_cl4 loaded"
            """
        platform = cluster-slurm
        [[[job]]]
            submission retry delays = 10*PT30S
            execution time limit = PT160S

    [[BACKGROUND_JOB]]
        inherit = HPC_SYSTEM
        platform = cluster-background

    [[PARALLEL_JOB]]
        inherit = HPC_SYSTEM
        [[[directives]]]
            --partition = defq
            --nodes = 1

    [[build_uscc_engine]]
        inherit = None, BACKGROUND_JOB
        [[[environment]]]
            ROSE_TASK_APP = build_uscc_engine

    [[get_cmems_mfwam_analysis_00]]
        inherit = None, PARALLEL_JOB
        [[[environment]]]
            ROSE_TASK_APP = cmems_download


That’s roughly how my tasks are set up. Some are BACKHROUND_JOBs, some are PARALLEL_JOBs.

But right now cylc even falls over itself.

Any help is appreciated. Thanks, Fred

Fair point, we should have a troubleshooting section, especially after all the Cylc 8 changes. We do have a doc issue up for that, but it hasn’t been addressed yet.

hard to figure out simple tasks like how to set up a platform configuration.

Have you have seen these sections?

Please let us know if you think that documentation is lacking.

I resorted to guesswork and put this in my ~/.cylc/flow/global.cylc

That looks fine.

With some guesswork I pieced together the installation (documentation is badly lacking here)

Can you describe what you think is lacking?

and then did linked /usr/local/bin/rose to point to the cylc script. (That bit is missing in the documentation).

Strictly speaking, Cylc does not “know about” Rose (although we do provide a plugin for cylc install to handle rose-suite.conf files). We’ll consider adding that to the Cylc docs though, since we do mention Rose in several places).

inspected the log directory on the cluster headnode. The documentation says I should be inspecting the “workflow log” but there’s no such thing (I guess it’s been renamed, and the docs haven’t kept up to point out where to find this).

In some contexts the terms “workflow” and “scheduler” are interchangeable because in Cylc a single scheduler instance runs a single workflow. Maybe that’s not clearly stated in the docs though. So, “workflow log” is the scheduler log file written by the scheduler running that workflow. The log directory contains:

<run-dir>/log/scheduler/...  # scheduler aka workflow logs
<run-dir>/log/job/...  # job logs

(I’ll return shortly to address other issues…)

Your workflow (/scheduler!) log shows that the first tasks in the workflow failed job submission.

That log file records task-related events that are either generated internally (i.e., inside the scheduler) or result from external interaction with the scheduler.

For job execution failure, the scheduler sees that the job failed, but not exactly why it failed - for that, you need to look at the job.err and job.out job logs.

Similarly, for job submission failure, the scheduler sees that job submission failed, but to see what exactly why it failed you need to look at the job-activity.log job log (which is for job-related external activity either side of actual job execution).

Weirdly, the fact that your job.err file is not empty implies that job submission did not actually fail (I wonder if you’re showing two different problems here?). However, job.err shows that the boiler plate job code (in job.sh) is not where it should be, so we need to figure that out…

First, I’d suggest debugging your installation with a minimal workflow:

[scheduling]
    [[graph]]
        R1 = foo
[runtime]
    [[foo]]
        script = "echo Hello World!"

That should run the one task foo as a local background job.

Then add platform = ... to check that your platform definitions are good.

When you cylc install this, the created run directory should look like this:

$ tree ~/cylc-run/fred
/home/oliverh/cylc-run/fred
├── _cylc-install
│   └── source -> /home/oliverh/cylc-src/fred
├── run1
│   ├── flow.cylc
│   └── log
│       └── install
│           └── 01-install.log
└── runN -> run1

Does it?

Then, we’ll take a look at running it…

Thanks Hilary!

I created the test-suite with the simple job (in a suite.rc for now) and run it:

(cylc) [model@cylcsrvr2 test-suite]$ cylc vip test-suite
$ cylc validate /home/model/cylc-src/test-suite
WARNING - Backward compatibility mode ON
Valid for cylc-8.2.0
$ cylc install /home/model/cylc-src/test-suite
INSTALLED test-suite/run2 from /home/model/cylc-src/test-suite
$ cylc play test-suite/run2

 ▪ ■  Cylc Workflow Engine 8.2.0
 ██   Copyright (C) 2008-2023 NIWA
▝▘    & British Crown (Met Office) & Contributors

2023-08-25T10:39:26+04:00 WARNING - Backward compatibility mode ON
2023-08-25T10:39:26+04:00 INFO - Extracting job.sh to /home/model/cylc-run/test-suite/run2/.service/etc/job.sh
test-suite/run2: cylcsrvr2 PID=3056206

Then I inspected the logfiles on the cylc server:

(cylc) [model@cylcsrvr2 test-suite]$ m /home/model/cylc-run/test-suite/run2/log/job/1/foo/NN/job.*
::::::::::::::
/home/model/cylc-run/test-suite/run2/log/job/1/foo/NN/job.err
::::::::::::::
Traceback (most recent call last):
  File "/opt/cylc/cylc-flow-7.9.1/bin/cylc-message", line 140, in <module>
    main()
  File "/opt/cylc/cylc-flow-7.9.1/bin/cylc-message", line 136, in main
    return record_messages(suite, task_job, messages)
  File "/opt/cylc/cylc-flow-7.9.1/lib/cylc/task_message.py", line 82, in record_messages
    'messages': messages})
  File "/opt/cylc/cylc-flow-7.9.1/lib/cylc/network/httpclient.py", line 273, in put_messages
    func_name = self._compat('put_messages')
  File "/opt/cylc/cylc-flow-7.9.1/lib/cylc/network/httpclient.py", line 176, in _compat
    self._load_contact_info()
  File "/opt/cylc/cylc-flow-7.9.1/lib/cylc/network/httpclient.py", line 646, in _load_contact_info
    self.port = int(self.comms1.get(self.srv_files_mgr.KEY_PORT))
TypeError: int() argument must be a string or a number, not 'NoneType'
Traceback (most recent call last):
  File "/opt/cylc/cylc-flow-7.9.1/bin/cylc-message", line 140, in <module>
    main()
  File "/opt/cylc/cylc-flow-7.9.1/bin/cylc-message", line 136, in main
    return record_messages(suite, task_job, messages)
  File "/opt/cylc/cylc-flow-7.9.1/lib/cylc/task_message.py", line 82, in record_messages
    'messages': messages})
  File "/opt/cylc/cylc-flow-7.9.1/lib/cylc/network/httpclient.py", line 273, in put_messages
    func_name = self._compat('put_messages')
  File "/opt/cylc/cylc-flow-7.9.1/lib/cylc/network/httpclient.py", line 176, in _compat
    self._load_contact_info()
  File "/opt/cylc/cylc-flow-7.9.1/lib/cylc/network/httpclient.py", line 646, in _load_contact_info
    self.port = int(self.comms1.get(self.srv_files_mgr.KEY_PORT))
TypeError: int() argument must be a string or a number, not 'NoneType'
::::::::::::::
/home/model/cylc-run/test-suite/run2/log/job/1/foo/NN/job.out
::::::::::::::
Workflow : test-suite/run2
Job : 1/foo/01 (try 1)
User@Host: model@cylcsrvr2

Hello World!
2023-08-25T10:39:29+04:00 INFO - started
2023-08-25T10:39:29+04:00 INFO - succeeded
::::::::::::::
/home/model/cylc-run/test-suite/run2/log/job/1/foo/NN/job.status
::::::::::::::
CYLC_JOB_RUNNER_NAME=background
CYLC_JOB_ID=3056688
CYLC_JOB_RUNNER_SUBMIT_TIME=2023-08-25T10:39:27+04:00
CYLC_JOB_PID=3056688
CYLC_JOB_INIT_TIME=2023-08-25T10:39:29+04:00
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2023-08-25T10:39:29+04:00

So we have a problem here that the cylc8 started from the mamba environment run bits of the cylc7 installation. Even though my PATH is fine when I’m in the cylc environment.

(cylc) [model@cylcsrvr2 test-suite]$ which cylc
~/mambaforge/envs/cylc/bin/cylc

I guess this is because the bashrc is set up to NOT activate the cylc environment by default. It cannot do that as it would ruin the cylc7 suites that are still running (until the migration is complete).

This seems to be a new issue, as I haven’t seen this in my earlier tests described above.

Do you have an idea on how to resolve this? Can you not have cylc7 and cylc8 running at the same time on the same cylc host? I really don’t want to have to set up a searate cylc host just for cylc8 to find its paths.

Could I get round this with a new user account?

NOTE: I have installed the cylc wrapper script (from cylc get-resources…`) only on the job host, not on the cylc host - because on the cylc host I need cylc7 and cylc8 to run side by side. The job host (the slurm cluster) is a new machine, so doesn’t need cylc7 on it, but on the cylc host I’m sharing them right now.

Maybe I could set up a new user account specifically for cylc8 suites, and have it set up so it activate the CYLC_VERSION and/or CYLC_HOME_ROOT differently for each user. So my question is
Can the new cylc8 wrapper script /usr/local/bin/cylc also be used for cylc7 installations, so I can run things side by side?

Thanks, Fred

Yes, the new wrapper script supports both Cylc 7 & 8

@fredw - yep, that would do it.

The job.out file shows your Cylc 8 test case successfully submits and runs the job. But the job.err file shows that the job can’t communicate its status back to the scheduler because the job environment is picking up Cylc 7 instead of Cylc 8.

If you need to run multiple versions of Cylc (potentially including multiple 7 and 8 variants):

  • you need the cylc wrapper in your default executable PATH
  • modify the wrapper (see its internal comments) to point to your local installation locations
  • make sure your login scripts do not hardwire variables like CYLC_VERSION and CYLC_HOME. The wrapper uses these - when invoked in job environments - to select the same Cylc version as the parent scheduler.

By the way, contrary to a statement above, turns out our “Advanced Installation” documentation (which briefly describes the wrapper) does mention the rose symlink!

https://cylc.github.io/cylc-doc/stable/html/installation.html#advanced-installation

under “managing environments”:

https://cylc.github.io/cylc-doc/stable/html/installation.html#managing-environments

Hi, you might want to see this section of the docs which contains example platform configuration patterns and adapt these to your needs:

https://cylc.github.io/cylc-doc/stable/html/reference/config/writing-platform-configs.html

Just went back and spotted this comment. I’m not 100% sure I understand what you’ve done here, but it doesn’t seem right at face value.

You should install each new version of Cylc into its own virtual environment - if using Conda (with or without the mamba solver) that means into its own conda environment. The environment should be named according to the Cylc version. You want to end up with a conda environment that packages Cylc with all of its dependencies. The wrapper script can then use that environment when it needs to invoke that version of Cylc. If you try to install Cylc into your “base” conda environment, or some other general environment, you’re much more likely to run into environment conflicts, and you won’t be able to install future Cylc releases alongside it. Cylc is, and is used as, an application. There’s no reason that it needs to be installed into a general conda environment with loads of other stuff in it.

I guess this is because the bashrc is set up to NOT activate the cylc environment by default

Note, you should not activate the Cylc environment in the wrapper script (this will change the environment of local jobs). Cylc will work without explicit environment activation.

Do you have an idea on how to resolve this? Can you not have cylc7 and cylc8 running at the same time on the same cylc host?

We have multiple parallel installations of Cylc 7 and Cylc 8 side-by-side, we use the wrapper script to switch between them:

$ CYLC_VERSION=7 cylc version 
7.8.13
$ CYLC_VERSION=8 cylc version
8.2.1

You should be able to use both versions on the same host with the same user account.

The issues you’re seeing are a deployment problem.

I guess this is because the bashrc is set up to NOT activate the cylc environment by default

One way this could have happened is if the Cylc 7 installation is present in the PYTHONPATH.

Thanks everyone,

I spent hours trying this out but I can’t get this to work. I am still not clear whether cylc is to be installed into a conda env named cylc-8.2.1 or named just cylc.
I tried to include the version number in the name of the conda env and this didn’t work with the /usr/local/bin/cylc wrapper script. It was obviously looking for an environment called cylc and then could find what it was looking for.

So I went back to installing everthing into a conda env called cylc on the cylc host and the job host.

Note that on the cylc host I can’t modify the /usr/local/bin/cylc wrapper script - the cylc host is a production environment for the older cylc7 suites, so I can’t muck around with those as I might stop those production suites from working.

In order to separate the cylc installations I created a new user called cylc on both cylc host and job host, installed the cylc conda env with:

yes | mamba create -n cylc cylc-flow cylc-uiserver cylc-rose metomi-rose

on both machines

I can now run this on the cylc host:

(cylc) [cylc@cylcsrvr2 downloader_suite_cl4]$ ssh cylc@metoc-cl4 cylc version
8.2.1
(cylc) [cylc@cylcsrvr2 downloader_suite_cl4]$ ssh cylc@metoc-cl4 which cylc
/usr/local/bin/cylc

NOTE, that I had to modify the wrapper script on the job host to be able to find cylc in the mamba env:

# install the cylc wrapper script in a dircetory that's in the PATH by default
cylc get-resources cylc /usr/local/bin/
# edit the cylc wrapper script to look for its environment in the user's mambaforge environment (not miniconda)
sed -i 's/# CYLC_HOME_ROOT_ALT=\${HOME}\/miniconda3\/envs/CYLC_HOME_ROOT_ALT=\${HOME}\/mambaforge\/envs/g' /usr/local/bin/cylc

I don’t understand where /opt comes in here - is this a compatability option for cylc7?

Anyway, I gave up on the idea of running it side by side, it’s just too risky that cylc8 destroys the delicate environment of cyl7 which is a production system, so I can’t really change that and simply hope that it’ll work. My terrible experience with cylc8 so far gives me no such hope.

ONce I had reverted everything to conda envs called cylc I tried to run the suites again. The simple test-suite suggested above worked. But that doesn’t actually do anything, it doesn’t really have any jobs that run on the job host, so I tried my “real” suite again.

(cylc) [cylc@cylcsrvr2 downloader_suite_cl4]$ cylc vip --debug downloader_suite_cl4
2023-08-29T03:07:09+04:00 DEBUG - Loading site/user config files
2023-08-29T03:07:09+04:00 DEBUG - Reading file /home/cylc/.cylc/flow/global.cylc
$ cylc validate /home/cylc/cylc-src/downloader_suite_cl4
2023-08-29T03:07:09+04:00 WARNING - Backward compatibility mode ON
2023-08-29T03:07:09+04:00 DEBUG - Reading file /home/cylc/cylc-src/downloader_suite_cl4/suite.rc
2023-08-29T03:07:09+04:00 WARNING - 'rose-suite.conf[jinja2:suite.rc]' is deprecated. Use [template variables] instead.
2023-08-29T03:07:09+04:00 DEBUG - Processing with Jinja2
2023-08-29T03:07:09+04:00 DEBUG - Setting Jinja2 template variables:
 
...

 ▪ ■  Cylc Workflow Engine 8.2.1
 ██   Copyright (C) 2008-2023 NIWA
▝▘    & British Crown (Met Office) & Contributors

2023-08-28T23:07:09Z WARNING - Backward compatibility mode ON
2023-08-28T23:07:09Z DEBUG - /home/cylc/cylc-run/downloader_suite_cl4/run6/log/scheduler: directory created
2023-08-28T23:07:09Z DEBUG - /home/cylc/cylc-run/downloader_suite_cl4/run6/log/job: directory created
2023-08-28T23:07:09Z DEBUG - /home/cylc/cylc-run/downloader_suite_cl4/run6/share: directory created
2023-08-28T23:07:09Z DEBUG - /home/cylc/cylc-run/downloader_suite_cl4/run6/work: directory created
2023-08-28T23:07:09Z INFO - Extracting job.sh to /home/cylc/cylc-run/downloader_suite_cl4/run6/.service/etc/job.sh
downloader_suite_cl4/run6: cylcsrvr2 PID=3983334

So the suite starts, and it does something on the job host. Hee’s the job.err os an R1 task that is supposed to run on the headnode (no slurm submit):

(cylc) cylc@metoc-cl4:~/cylc-run/downloader_suite_cl4/run6/log/job/20230818T0000Z/build_uscc_engine/01$ cat job.err
/home/cylc/cylc-run/downloader_suite_cl4/run6/log/job/20230818T0000Z/build_uscc_engine/01/job: line 89: /home/cylc/cylc-run/downloader_suite_cl4/run6/.service/etc/job.sh: No such file or directory
/home/cylc/cylc-run/downloader_suite_cl4/run6/log/job/20230818T0000Z/build_uscc_engine/01/job: line 90: cylc__job__main: command not found

So it didn’t get far, it couldn’t even find the script it should have installed itself, the file /home/cylc/cylc-run/downloader_suite_cl4/run6/.service/etc/job.sh wasn;t there, because the run6 directory only contains a log directory and nothing else:

ll /home/cylc/cylc-run/downloader_suite_cl4/run6
total 4
drwxrwxr-x 3 cylc cylc   25 Aug 29 03:07 ./
drwxrwxr-x 5 cylc cylc 4096 Aug 29 03:07 ../
drwxrwxr-x 3 cylc cylc   25 Aug 29 03:07 log/

I am totally lost. What am I doing wrong?
Can you please help?

Thanks,
Fred

You should use the version number, if you want to be able to install future releases alongside this one. However, what matters is that the cylc wrapper knows how to find it according to the environment variables that it looks at.

The wrapper is not expected to work out of the box, you need to modify it to suit your site - specifically, the location of your installed Cylc versions. [Looks like you have done that below]

I can’t see /opt anywhere in your transrcript. Presumably you mean the one place it appears in the freshly extracted wrapper. That is merely an example location, hence the preceding comment:

##############################!!! EDIT ME !!!##################################
# Centrally installed Cylc releases:
CYLC_HOME_ROOT="${CYLC_HOME_ROOT:-/opt}"

The idea is, it should default to the location under which you install Cylc versions, but the syntax allows users (most likely Cylc developers) to override that by setting CYLC_HOME_ROOT themselves.

Get the new Cylc 8 wrapper working for both Cylc 7 and Cylc 8 on your system, then (once fully tested!) replace the old wrapper with the new. During testing, just make sure the new wrapper (e.g. in $HOME/bin) appears first in your $PATH.

Sorry for the pain! You can rest assured this is just a deployment issue. Once you have it working it will stay working for current and future installed Cylc versions. That said, by all means report back exactly what you think would make the documentation better.

Yes the simple test suite only runs a local background job. That’s the zero-th order check that your new installation works - and in fact it tests pretty much every aspect of Cylc apart from remote job interaction. Once that works, modify it slightly to launch the job on a remote host.

We’ll figure it out. Is the “head node” remote, with respect to where you’re running Cylc?

If so, does it see the same shared filesystem? It if doesn’t see the same FS, does your job platform definition specify an appropriate “install target” for installation of workflow files?

Here’s a simple example that runs one local background task, followed by one remote background task on a platform that does not see the same filesystem:

# flow.cylc
[scheduling]
    [[graph]]
        R1 = "foo => bar"
[runtime]
   [[root]]
      script = "echo Hello World"
   [[foo]]  # local task
   [[bar]]  # remote task
      platform = my_remote

The remote job platform definition:

# global.cylc
[platforms]
    [[my_remote]]
        hosts = head1
        install target = head1  # (if omitted, this defaults to the platform name)

On running this:

$ cylc vip --no-detach --no-timestamp 

-> $ cylc validate /home/oliverh/cylc-src/gug
Valid for cylc-8.3.0.dev
-> $ cylc install /home/oliverh/cylc-src/gug
INSTALLED gug/run4 from /home/oliverh/cylc-src/gug
-> $ cylc play --no-detach --no-timestamp gug/run4

 ▪ ■  Cylc Workflow Engine 8.3.0.dev
 ██   Copyright (C) 2008-2023 NIWA
▝▘    & British Crown (Met Office) & Contributors

INFO - Extracting job.sh to /home/oliverh/cylc-run/gug/run4/.service/etc/job.sh
INFO - Workflow: gug/run4
INFO - Scheduler: url=tcp://NIWA-1022450.niwa.local:43026 pid=6938
INFO - Workflow publisher: url=tcp://NIWA-1022450.niwa.local:43044
INFO - Run: (re)start number=1, log rollover=1
INFO - Cylc version: 8.3.0.dev
INFO - Run mode: live
INFO - Initial point: 1
INFO - Final point: 1
INFO - Cold start from 1
INFO - New flow: 1 (original flow from 1) 2023-08-29 15:17:07
INFO - [1/foo waiting(runahead) job:00 flows:1] => waiting
INFO - [1/foo waiting job:00 flows:1] => waiting(queued)
INFO - [1/foo waiting(queued) job:00 flows:1] => waiting
INFO - [1/foo waiting job:01 flows:1] => preparing
INFO - [1/foo preparing job:01 flows:1] submitted to localhost:background[6958]
INFO - [1/foo preparing job:01 flows:1] => submitted
INFO - [1/foo submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
INFO - [1/foo submitted job:01 flows:1] => running
INFO - [1/foo running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
INFO - [1/foo running job:01 flows:1] => succeeded
INFO - [1/bar waiting(runahead) job:00 flows:1] => waiting
INFO - [1/bar waiting job:00 flows:1] => waiting(queued)
INFO - [1/bar waiting(queued) job:00 flows:1] => waiting
INFO - [1/bar waiting job:01 flows:1] => preparing
>>> INFO - platform: my_remote - remote init (on head1)
>>> INFO - platform: my_remote - remote file install (on head1)
>>> INFO - platform: my_remote - remote file install complete
INFO - [1/bar preparing job:01 flows:1] submitted to my_remote:background[386]
INFO - [1/bar preparing job:01 flows:1] => submitted
INFO - [1/bar submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
INFO - [1/bar submitted job:01 flows:1] => running
INFO - [1/bar running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
INFO - [1/bar running job:01 flows:1] => succeeded
INFO - Workflow shutting down - AUTOMATIC
INFO - platform: my_remote - remote tidy (on head1)
INFO - DONE

In the transcript above I’ve highlighted remote platform initialization with >>>. This should happen the first time the scheduler runs a task on the remote, if it has a different install target (i.e., if it does not see the same filesystem as the scheduler).

If I screw up the platform definition by adding install target = localhost, then:

  • I will not see the remote initialization lines in the scheduler log
  • the remote task will appear to hang in the scheduler - the job can’t communicate its status back because the remote platform wasn’t initialized for the workflow
  • the remote job error log shows:
$ cylc log -f e gug//1/bar
/other/cylc-run/gug/run2/log/job/1/bar/01/job: line 45: /root/cylc-run/gug/run2/.service/etc/job.sh: No such file or directory
/other/cylc-run/gug/run2/log/job/1/bar/01/job: line 46: cylc__job__main: command not found

Which looks suspiciously like your problem!

Hello Hilary,

Thanks for your comments, there’s some gold in there.

I made a change to my global.cylc file and made some progress. The content and creation of this file should be explained better (with examples) in the documentation.
I’ve just realised from your comments what the installhost means. So I now have this file:

[platforms]
    # The localhost platform is available by default
    # [[localhost]]
    #     hosts = localhost
    #     install target = localhost
    [[cluster-slurm]]
        hosts = metoc-cl4
        job runner = slurm
        retrieve job logs = True
        install target = cluster

    [[cluster-at]]
        hosts = metoc-cl4
        job runner = at
        install target = cluster


    [[cluster-background]]
        hosts = metoc-cl4
        job runner = background
        retrieve job logs = True
        install target = cluster

It was crucial to name the install target something other than localhost. By cluster doesn’t share a filesystem with the cylc host. So your explanation was spot on. Thanks

A task in my “real” suite which runs on the headnode now works - this is a task which compiles executables used later on as part of the R1.

After that the slurm jobs kick in but currently fail. Here’s a snippted from the scheduler log:

2023-09-01T09:58:01Z DEBUG - [jobs-submit cmd] cat /home/cylc/cylc-run/downloader_suite_cl4/run7/log/job/20230818T0000Z/get_cmems_mfwam_analysis_00/05/job /home/cylc/cylc-run/downloader_suite_cl4/run7/log/job/20230818T0000Z/get_cmems_mfwam_forecast_00/05/job | ssh -oBatchMode=yes -oConnectTimeout=10 metoc-cl4 env CYLC_VERSION=8.2.1 bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-submit --debug --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/downloader_suite_cl4/run7/log/job' 20230818T0000Z/get_cmems_mfwam_analysis_00/05 20230818T0000Z/get_cmems_mfwam_forecast_00/05
    [jobs-submit ret_code] 0
    [jobs-submit out]
    [TASK JOB SUMMARY]2023-09-01T09:58:01Z|20230818T0000Z/get_cmems_mfwam_analysis_00/05|1|
    [TASK JOB COMMAND]2023-09-01T09:58:01Z|20230818T0000Z/get_cmems_mfwam_analysis_00/05|[STDERR] [Errno 2] No such file or directory: 'sbatch'[TASK JOB SUMMARY]2023-09-01T09:58:01Z|20230818T0000Z/get_cmems_mfwam_forecast_00/05|1|
    [TASK JOB COMMAND]2023-09-01T09:58:01Z|20230818T0000Z/get_cmems_mfwam_forecast_00/05|[STDERR] [Errno 2] No such file or directory: 'sbatch'
2023-09-01T09:58:01Z ERROR - [jobs-submit cmd] ssh -oBatchMode=yes -oConnectTimeout=10 metoc-cl4 env CYLC_VERSION=8.2.1 bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-submit --debug --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/downloader_suite_cl4/run7/log/job' 20230818T0000Z/get_cmems_mfwam_analysis_00/05 20230818T0000Z/get_cmems_mfwam_forecast_00/05
    [jobs-submit ret_code] 1
    [jobs-submit out] 2023-09-01T09:58:01Z|20230818T0000Z/get_cmems_mfwam_analysis_00/05|1|
2023-09-01T09:58:01Z DEBUG - [20230818T0000Z/get_cmems_mfwam_analysis_00 preparing job:05 flows:1] (internal)submission failed at 2023-09-01T09:58:01Z
2023-09-01T09:58:01Z CRITICAL - [20230818T0000Z/get_cmems_mfwam_analysis_00 preparing job:05 flows:1] submission failed
2023-09-01T09:58:01Z INFO - [20230818T0000Z/get_cmems_mfwam_analysis_00 preparing job:05 flows:1] => waiting

The crucial bit here is that sbatch command cannot be found.

In my interactive shell I run module load slurm in my .bashrc, but this doesn’t get executed in a remote non-interactive shell (where the module command isn’t available anyway).

Is there an accepted way of making slurm command available to a remote shell?

I can see how cylc solves this problem with a wrapper script in /usr/local/bin but how could I solve this with other commands required to run parts of the suite.

I should note that in my case cylc is trying to run sbatch way before I get to anything in my pre-script or script runtimes, it’s run by cylc itself, and the shell it’s running in is at this point still in a virgin state where the module command is unavailable.

I tried the following. I edited the .bashrc for the cylc user on the headnode (the job host):

# ~/.bashrc: executed by bash(1) for non-login shells.
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples

module load slurm

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

...more stuff...

then I tried to run this on the cylc host:

$ ssh cylc@metoc-cl4 which sbatch
/home/cylc/.bashrc: line 5: module: command not found

and this confirms that in this non-interactive shell the module command is not yet available.

Do you have a recommendation for how to resolve this?
Thanks, Fred

I’ve made a temporary fix to try and get sbatch to run by changing the non-interactive part of my .bashrc:

# ~/.bashrc: executed by bash(1) for non-login shells.
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples

export PATH="/cm/shared/apps/slurm/current/bin/:$PATH"

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

by setting the PATH here the non-interactive shell run by cylc can now find slurm and execute the sbatch command.

The slurm jobs still don’t run, so I tried doing it by hand.

So, on the headnode (the job host) I look at the log directory for the job, and only see 2 files:

$ ll ~/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/get_cmems_med_forecast/NN/
total 8
drwxrwxr-x 2 cylc cylc   47 Sep  1 14:25 ./
drwxrwxr-x 8 cylc cylc  108 Sep  1 14:25 ../
-rwxrwxr-x 1 cylc cylc 3629 Sep  1 14:24 job*
-rw-rw-r-- 1 cylc cylc   27 Sep  1 14:25 job.status

the job.status contains nothing useful, just this:

CYLC_JOB_RUNNER_NAME=slurm

the job file contains the slurm script. This is where I noticed an odd thing amongst the directives:

# DIRECTIVES:
#SBATCH --job-name=get_cmems_med_forecast.20230818T0000Z.downloader_suite_cl4/run8
#SBATCH --output=cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/get_cmems_med_forecast/06/job.out
#SBATCH --error=cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/get_cmems_med_forecast/06/job.err
#SBATCH --time=150:00
#SBATCH --partition=defq
#SBATCH --ntasks=1

the file path of the out/err files is a relative path starting with cylc-run/... which is odd. This filename relative to what? Shouldn’t this be an absolute path?

When I run through cylc I don’t see any job.out or job.err in the NN log directory, because I think it’s lost in the woods of a file path that doesn’t exist.

I can run it manually from the honme directory:

$ cd $HOME
$ sbatch ~/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/get_cmems_med_forecast/NN/job
Submitted batch job 13
$ ll ~/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/get_cmems_med_forecast/NN/
total 84
drwxrwxr-x 2 cylc cylc   107 Sep  1 14:51 ./
drwxrwxr-x 8 cylc cylc   108 Sep  1 14:25 ../
-rwxrwxr-x 1 cylc cylc  3629 Sep  1 14:24 job*
-rw-rw-r-- 1 cylc cylc  1735 Sep  1 14:51 job.err
-rw-rw-r-- 1 cylc cylc  1150 Sep  1 14:51 job.out
-rw-rw-r-- 1 cylc cylc   145 Sep  1 14:51 job.status
-rw-rw-r-- 1 cylc cylc 67130 Sep  1 14:51 job.xtrace

and now I see the out/err files in their directory.

How do I fix the relative pathnames to be absolute ones, so I see the out/err files when cylc runs it?

But at least I can now examine the error output and it looks like this:

$ cat /home/cylc/cylc-run/downloader_suite_cl4/run8/log/job/20230818T0000Z/get_cmems_med_forecast/NN/job.err
Sending DEBUG MODE xtrace to job.xtrace
2023-09-01T10:51:36Z DEBUG - zmq:send {'command': 'graphql', 'args': {'request_string': '\nmutation (\n  $wFlows: [WorkflowID]!,\n  $taskJob: String!,\n  $eventTime: String,\n  $messages: [[String]]\n) {\n
 message (\n    workflows: $wFlows,\n    taskJob: $taskJob,\n    eventTime: $eventTime,\n    messages: $messages\n  ) {\n    result\n  }\n}\n', 'variables': {'wFlows': ['downloader_suite_cl4/run8'], 'taskJo
b': '20230818T0000Z/get_cmems_med_forecast/06', 'eventTime': '2023-09-01T10:51:36Z', 'messages': [['INFO', 'started']]}}, 'meta': {'prog': 'message', 'host': 'node01', 'comms_method': 'zmq'}}
2023-09-01T10:51:36Z DEBUG - zmq:recv {'data': {'message': {'result': [{'id': '~cylc/downloader_suite_cl4/run8', 'response': [True, 'Messages queued: 1']}]}}, 'user': 'cylc'}
/cm/local/apps/slurm/var/spool/job00013/slurm_script: line 98: rose: command not found
2023-09-01T10:51:37Z CRITICAL - failed/ERR
2023-09-01T10:51:37Z DEBUG - zmq:send {'command': 'graphql', 'args': {'request_string': '\nmutation (\n  $wFlows: [WorkflowID]!,\n  $taskJob: String!,\n  $eventTime: String,\n  $messages: [[String]]\n) {\n
 message (\n    workflows: $wFlows,\n    taskJob: $taskJob,\n    eventTime: $eventTime,\n    messages: $messages\n  ) {\n    result\n  }\n}\n', 'variables': {'wFlows': ['downloader_suite_cl4/run8'], 'taskJo
b': '20230818T0000Z/get_cmems_med_forecast/06', 'eventTime': '2023-09-01T10:51:37Z', 'messages': [['CRITICAL', 'failed/ERR']]}}, 'meta': {'prog': 'message', 'host': 'node01', 'comms_method': 'zmq'}}
2023-09-01T10:51:37Z DEBUG - zmq:recv {'data': {'message': {'result': [{'id': '~cylc/downloader_suite_cl4/run8', 'response': [True, 'Messages queued: 1']}]}}, 'user': 'cylc'}

So it cannot find the rose command. Which is odd, because I created a soft-link on the job host from /use/local/bin/rose to point to the cylc wrapper script.

Running on the cylc host:

[cylc@cylcsrvr2 ~]$ ssh cylc@metoc-cl4 which rose
/usr/local/bin/rose

So the command is there in a standard bin directory.

I can probably hack something so it’ll find rose, but the bug with the relative filenames seems more important at the moment. Can you help?

Thank,
Fred

Hi Fred.

documentation is badly lacking here

The content and creation of this file should be explained better (with examples) in the documentation.

As referenced in the comments above, it is!

This section explains platforms, install targets and gives multiple examples covering all the major platform configuration patterns we are aware of:

https://cylc.github.io/cylc-doc/stable/html/reference/config/writing-platform-configs.html

And this section provides documentation for each of the configurations in the [platforms] section:

https://cylc.github.io/cylc-doc/stable/html/reference/config/global.html#global.cylc[platforms]

Please be aware that Cylc is an open source project offered to you with support free of charge. Feel free to contribute any documentation suggestions you might have to the cylc-doc project.

the file path of the out/err files is a relative path starting with cylc-run/... which is odd. This filename relative to what? Shouldn’t this be an absolute path?

I don’t think this is incorrect, my slurm job scripts use relative paths too. Cylc should run this command from the $HOME directory too. Are you changing directory in any of your shell startup files?

Please ensure the slurm submission created by Cylc has been successful, if not you should find evidence in the scheduler log (log/scheduler/log), job-activity.log and slurm DB.

When I run through cylc I don’t see any job.out or job.err in the NN log directory, because I think it’s lost in the woods of a file path that doesn’t exist.

When these files are created comes down to the batch system, likely either when the job starts or after it finishes.

rose: command not found

The wrapper script symlink you’ve created is not showing up as an executable in $PATH. Check $PATH and the filesystem permissions, you might need to chmod +x rose.

Note I did provide links to these doc sections in my first response above on this thread.

@fredw - I guess the question is, did you just not find this documentation, or did find it but you still think it is lacking (and in that case, what needs improving?)

Right, the default environment for non-interactive shells on your system does not include the module command, and thence Slurm sbatch.

So this is not a Cylc problem. You would have exactly the same issue if you tried to submit a bunch of jobs from a shell script on the host that you’re running Cylc on.

All you need to do is find what makes the module command available to login shells, and put that in your .bashrc. Offhand I’m not sure, but I have a feeling module might be a shell function sourced from /etc/profile or something like /etc/profile.d/modules rather than an executable at some PATH location.

Then you’ll be good to go, whether with Cylc or any other means of automatic remote job submission.