We are using Cylc 8 platforms and platform groups to make our workflows portable across different realms (user, test, prod). For example, we have the following platform groups for a project called kit
in user
space:
[platform groups]
[[dm]]
platforms = user-dm_kit_c
[[xc]]
platforms = user-xc_kit_c
[[cs]]
platforms = user-cs_kit_c
[[ond-xc]]
platforms = user-ond-xc_kit_c
[[ond-cs]]
platforms = user-ond-cs_kit_c
[[ppn]]
platforms = user-ppn_kit_c
In the test
realm, these would instead look like:
[platform groups]
[[dm]]
platforms = test-dm_kit_c
[[xc]]
platforms = test-xc_kit_c
[[cs]]
platforms = test-cs_kit_c
[[ond-xc]]
platforms = test-ond-xc_kit_c
[[ond-cs]]
platforms = test-ond-cs_kit_c
[[ppn]]
platforms = test-ppn_kit_c
With these platform groups, workflow tasks only specify the type of node they want (eg dm
or cs
etc) and we generate the rest of the information in [platforms]
with copious amounts of jinja2. For example:
[platforms]
[[user-ond-cs_kit_c]]
hosts = HOSTNAME
job runner = pbs
install target = user:c:kit
retrieve job logs = True
retrieve job logs retry delays = PT1M, PT1M, PT1M, PT10M, PT10M
global init-script = """
...
"""
[[[directives]]]
-P = kit
-q = QUEUE
-W umask = 0022
[[[meta]]]
realm = user
queue = QUEUE
project = kit
disk = c
Sometimes (usually after a workflow has already run more than one successful cycle) jobs fail with:
[jobs-submit cmd] (platform not defined)
[jobs-submit ret_code] 1
[jobs-submit err] No matching platform "user-dm_kit_c" found
or:
[jobs-submit cmd] (platform not defined)
[jobs-submit ret_code] 1
[jobs-submit err] Unable to find a platform from group cs.
This is happening intermittently across all workflows and users, and definitely happens in tasks that differ only by cycle point from other successful runs. At a guess, when the global.cylc
file is reloaded every 10 minutes to re-assess the available/condemned hosts, there is a race condition between processing the jinja2-generated platforms, and attempting to use the not-yet processed platforms. Re-triggering failed jobs does not always fix them.
Suggestions on how to confirm this is a race condition, or how to avoid this issue would be appreciated.