[runtime][<task name>][job]batch system equivalent in cylc 8.3.4

Hello,

I am trying to rewrite some old Cylc v7 suite.rc files into v8.3.4 format and I am having trouble understanding how to define platforms and job runners. In Cylc7, I had the following in a inc/platform.rc file that would then be included in the suite.rc and inherited by certain tasks to define if that task should run in the background vs the node:

[runtime]
    [[BG_TASK]]
        [[[job]]]
            batch system = background

I went through the 8.x tutorial and see that [[[job]]] is gone now and has been replaced with the (I think) [platforms] utility in $HOME/.cylc/flow/8.3.4/global.cylc. So I wrote the following in my global.cylc file:

[platforms]
    [[BG_TASK]]
        job runner = background

then in my flow.cylc file I have:

[runtime]
  [[BACKGROUND_QUEUE]]
     init-script = "set -x"
     platform = BG_TASK

which can then be inherited by tasks that I want to run in the background. The jobs wonā€™t submit. The only output I get from running the workflow is a job-activity.log file that says:

[jobs-submit cmd] (init BG_TASK)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED
[jobs-submit cmd] (remote init)
[jobs-submit ret_code] 1

My understanding was that you define custom named platforms in your global.cylc file which can then be referenced in your flow.cylc file using [runtime][]platform. Where did I go wrong here?

Thatā€™s right. At a site with many Cylc users youā€™d expect platforms to be defined centrally, but if not you can do it in your user global.cylc file.

Youā€™ll get more information from the scheduler log. Run your workflow again with --no-detach or use cylc cat-log <workflow-id> (or look in ~/cylc-run/<workflow-id>/log/scheduler/log):

So: Cylc could not connect to your platform because ā€œUnable to find valid host for BG_TASKā€. You did not list any hosts in your platform definition, so Cylc tried the default (host-name = platform-name), which didnā€™t work.

Take a look at the platforms documentation here: Platform Configuration ā€” Cylc 8.3.4 documentation

You need to define a list of hosts and an install target, as well as job runner. The minimal platform definition for local background jobs (which, by the way, is the default if you donā€™t specify a platform at all):

# global.cylc
[platforms]
    [[BG_TASK]]
        job runner = background
        hosts = localhost
        install target = localhost

thanks for the response, that makes sense and thank you for linking the documentation associated with platform configuration.

I am trying to think how to adapt this methodology to one of our suites where we include a multitude of platform files (i.e. HPC1_include.rc, HPC2_include.rc, HP3_include.rc) that the user has to choose from based on what machine the user is running on. Each of these files will have a different version of, say, job runner (i.e. pbs vs slurm).

Now that the definition of these platforms is delegated to the global.cylc file, it seems like the user would have to know to correctly change the specifications of their platforms in their global.cylc file prior to using our version controlled Cylc workflow.

A workaround might be to use $CYLC_SITE_CONF_PATH and hardwire that in flow.cylc[runtime][root] to be $CYLC_WORKFLOW_RUN_DIR/etc. In $CYLC_WORKFLOW_RUN_DIR/etc, I would have something like HPC1.cylc, HPC2.cylc, etc. and prior to installing the workflow, the user must symlink global.cylc to one of these files. Am I over thinking this?

You should not need to use CYLC_SITE_CONF_PATH for this.

You can define all the platforms in the same global.cylc, and just select the right ones as needed in workflow task definitions.

Platform definitions can overlap - e.g. you could define a platform to run background jobs on one particular host of several that also appear in a platform with PBS as a job runner (note that platform hosts are where Cylc interacts with the job runner, not the compute notes managed by the job runner itself).

I think you could keep the same set of flow.cylc include files, but just change their content to set the right platform name for the task family.

Is there anyway to do this without a global.cylc? Iā€™d prefer everything needed to run the workflow to be self contained in one git clone checkout without the need for the user to go into their $HOME/.cylc/flow and change/create a file.

No, we donā€™t support platform definitions inside a workflow configuration, because platforms are inherently not workflow-specific.

Ideally, normal users shouldnā€™t even need to understand how to define platforms, they should just choose from the centrally-defined ones.

If that has not been done (i.e., no central definitions), you can define your own platforms, but the principle is the same - all of your workflows should select from the same platforms, no need to redefine them in every workflow.

Is it not feasible to have a central global.cylc for all users on your workflow scheduler host(s)?

Note thereā€™s likely to be other global config needed too, not just platforms.

Okay that makes sense. It seems like we will have to figure out a way to create a centrally-defined global.cylc for each of our HPC systems that all users can reference via the *_include.rc files included in the workflow checkout

Our lab hasnā€™t made the full transition to using Cylc8 so there isnā€™t much of an infrastructure yet. There is no central global.cylc yet, but this seems like the direction our development should move in as we progress in this transition

Agreed!

Note you only need a global.cylc on Cylc scheduler hosts, not on job platforms.

Typically we have a small pool of scheduler hosts, but theyā€™re all on the same shared filesystem, so thatā€™s only one global config file.

Slight aside (handy if you are upgrading from Cylc 7) -

If you have the platform setup correct and you run a workflow in compatibility mode (i.e. use Cylc8 on a workflow defined in a suite.rc) Cylc will select a platform from the old settings if one can be found which matches.

Locally we recommend the following steps when upgrading Cylc7 ā†’ Cylc8 (see also docs on upgrading):

  1. Use Cylc 7 cylc validate to ensure that there are no upgrade warnings relating to Cylc 6 ā‡’ 7 upgrades. If there are, fix them.
  2. Use Cylc 8 cylc validate to fix anything compatibility mode cannot handle.
  3. Try installing and running (cylc vip) the workflow (this is compatibility mode) and check that it works.
  4. Move suite.rc ā‡’ flow.cylc.
  5. Re-run cylc validate and cylc lint --ruleset 728, and fix any further warnings.
  6. Try running the workflow.
1 Like

Can you clarify the difference between scheduler hosts and job platforms? Is the scheduler host for BG_TASK (as defined above in the previously discussed global.cylc) the localhost and is the job platform equivalent to the job runner (i.e. ā€˜backgroundā€™)?

Thanks for the steps. Iā€™ll probably use these when updating my big suite to a flow!

The scheduler host is the server or VM where your Cylc scheduler runs, to manage your workflow.

Job platforms are where the scheduler submits jobs to run.

The scheduler host - usually called localhost (because most things in Cylc are relative to the scheduler) - is often also a job platform (in fact it is the default job platform, if you donā€™t specify one) if you need to run task jobs locally on the scheduler host.

The global.cylc file configures how Cylc schedulers behave, including telling them what job platforms are available to submit jobs to - so it has to be readable by the scheduler on the scheduler host. You donā€™t need a global.cylc file on the job platforms (if you run local jobs on scheduler host the global.cylc file will of course happen to be there - but itā€™s not used by the jobs).

That doesnā€™t quite make sense.

  • BG_TASK is a job platform. A job platform does not ā€œhaveā€ a scheduler host - thatā€™s not a property of job platforms. A job platform is just where a scheduler, running on a scheduler host, can submit jobs.
  • the job platform is not equivalent to the job runner, but a job platform has a job runner. E.g. on platform ā€œHPC1ā€ (say) you might have Slurm as a job runner. The ā€œbackgroundā€ job runner is just how we tell Cylc to run jobs ā€œin the backgroundā€ - i.e. as a direct subprocess - rather than submitting them to a proper resource manage like Slurm or PBS.

Okay, it is becoming a bit more clear. Let me see if I understand. The scheduler host is essentially the machine that Cylc scheduler will run on and this is, by default, localhost. If the need arises, is this default changed in global.cylc[scheduler][run hosts]available? Is the scheduler host also where the

Localhost is also a default defined job platform, but you can add in custom job platforms in global.cylc with different job submission settings (i.e. job runner).

The global.cylc needs to be on localhost because the scheduler host will look for it there (unless otherwise specified, by what i am guessing is global.cylc[scheduler][run hosts]available). I have always had the global.cylc - or the global.rc in Cylc7 - on my login node without understanding why but this clears that up)

Couple more questions regarding the global.cylc[platforms]:

  • When you said ā€œthe scheduler hostā€¦ is often also a job platformā€, are you referring to global.cylc[platforms]hosts?
  • In an example on how I run the suite, if I am running Cylc on the login node of an HPC and I want the suite to run by submitting via pbs to be run on the compute nodes, what would be my scheduler host and my job platform?
  • Is global.cylc[platforms]install target used by Cylc to define where the job.err/.out files will go? Are those files what the documentation refers to as ā€œremote file installationā€?

Is there a way I can check to see if there are central definition of job platforms?

Yes. Also known as the ā€œscheduler run hostā€, as per the global config ā€œrun hostsā€ settings.

It is ā€œby default, localhostā€ in the sense that if you run cylc play on the command line, it will by default start the scheduler locally. If a pool of ā€œrun hostsā€ are configured, it will start the scheduler on one of those instead.

The name ā€œlocalhostā€ is also used in some configuration settings to refer to the host that the scheduler is running on (because those settings configure the scheduler program, and the host it is running on is ā€œlocalhostā€ as far as it is concerned).

Yes.

Yes-ish. More specifically, the global.cylc file has to be on a filesystem that is visible from the scheduler run host. If you have a pool of run hosts, they all have to be on the same shared filesystem, so thereā€™s still only one central global.cylc file.

Well, you can put explicit platform settings for localhost in global.cylc if you like, but note those settings have to be valid for all scheduler run hosts (again, these settings are interpreted by schedulers, so ā€œlocalhostā€ refers to the scheduler host).

However, my point there was really that we do often run jobs, as well as schedulers, on scheduler hosts. A single user installation on a laptop, for instance, likely runs everything on the same host (the laptop) and probably runs all jobs as simple background jobs (no PBS or Slurm).
Even on an HPC cluster, the default if you do not specify a platform in a task definition is to run the job as a local background job - and that does not require any global.cylc ā€œlocalhostā€ platform settings.

The scheduler host (and therefore ā€œlocalhostā€ in Cylc configs, is the login node. And you should define, in global.cylc, a job platform that specifies (a) pbs as the job runner; (b) localhost as the install target; and (c) hosts = localhost. The platform definition does not need to list compute nodes as hosts. It just lists the host(s) that Cylc can use to interact with the job runner - to submit, poll (query), or kill jobs.

If your system has multiple login nodes from which you can interact with PBS, and other ā€œinteractive nodesā€ you can use for e.g. manual postprocessing work, it would probably be better to run the Cylc scheduler there (on an interactive node) and define a job platform that lists the multiple login nodes as hosts - thatā€™s more robust as Cylc can continue to run and manage jobs even if one of the login nodes goes down.

No, an install target represents a filesystem, and is used to install workflow source files on job platforms. A platform may have multiple hosts, and multiple platforms might be on the same filesystem. Cylc only needs to install files once, for all the hosts that see the same filesystem.

https://cylc.github.io/cylc-doc/stable/html/reference/config/writing-platform-configs.html#what-are-install-targets

Yes! Just type cylc config in your terminal. It parses and prints the global config by default (i.e., if you donā€™t give it a specific workflow ID). It even has a special option to print just the platform definitions. See ā€œPlatform printing optionsā€ in cylc config --help.

Thank you, this is all very helpful information. I 'll reply to this thread if I have further questions while I update my suites to Cylc 8 workflows

2 Likes

From cylc config --help

  Platform printing options:
    --platform-names    Print a list of platforms and platform group names
                        from the configuration.
    --platforms         Print platform and platform group configurations,
                        including metadata.