Possible race condition in global.cylc reload

We are using Cylc 8 platforms and platform groups to make our workflows portable across different realms (user, test, prod). For example, we have the following platform groups for a project called kit in user space:

[platform groups]
    [[dm]]
        platforms = user-dm_kit_c
    [[xc]]
        platforms = user-xc_kit_c
    [[cs]]
        platforms = user-cs_kit_c
    [[ond-xc]]
        platforms = user-ond-xc_kit_c
    [[ond-cs]]
        platforms = user-ond-cs_kit_c
    [[ppn]]
        platforms = user-ppn_kit_c

In the test realm, these would instead look like:

[platform groups]
    [[dm]]
        platforms = test-dm_kit_c
    [[xc]]
        platforms = test-xc_kit_c
    [[cs]]
        platforms = test-cs_kit_c
    [[ond-xc]]
        platforms = test-ond-xc_kit_c
    [[ond-cs]]
        platforms = test-ond-cs_kit_c
    [[ppn]]
        platforms = test-ppn_kit_c

With these platform groups, workflow tasks only specify the type of node they want (eg dm or cs etc) and we generate the rest of the information in [platforms] with copious amounts of jinja2. For example:

[platforms]
    [[user-ond-cs_kit_c]]
        hosts = HOSTNAME
        job runner = pbs
        install target = user:c:kit
        retrieve job logs = True
        retrieve job logs retry delays = PT1M, PT1M, PT1M, PT10M, PT10M
        global init-script = """
        ...
        """
        [[[directives]]]
            -P = kit
            -q = QUEUE
            -W umask = 0022
        [[[meta]]]
            realm = user
            queue = QUEUE
            project = kit
            disk = c

Sometimes (usually after a workflow has already run more than one successful cycle) jobs fail with:

[jobs-submit cmd] (platform not defined)
[jobs-submit ret_code] 1
[jobs-submit err] No matching platform "user-dm_kit_c" found

or:

[jobs-submit cmd] (platform not defined)
[jobs-submit ret_code] 1
[jobs-submit err] Unable to find a platform from group cs.

This is happening intermittently across all workflows and users, and definitely happens in tasks that differ only by cycle point from other successful runs. At a guess, when the global.cylc file is reloaded every 10 minutes to re-assess the available/condemned hosts, there is a race condition between processing the jinja2-generated platforms, and attempting to use the not-yet processed platforms. Re-triggering failed jobs does not always fix them.

Suggestions on how to confirm this is a race condition, or how to avoid this issue would be appreciated.

Hi @jarich

That does seem like a bug, if it’s intermittent as you say. The UK team will be best placed to respond to this one I think. It was a bank holiday there Monday, but they’ll be back online soon. If it’s not clear how this could happen, from your description, we might need to try to reproduce your setup somehow.

We’ve not seen this before.

Note that whilst Cylc does reload the global config to check condemned hosts, this shouldn’t affect platform definitions which should remain stable (we haven’t knowingly implemented refreshing platform definitions or other global config values, yet). This error appears to be coming from the job-submission script which is run as a subprocess outside of the Cylc scheduler, but may also need to load the global configuration. It could be that the Jinja2 logic that generates the global.cylc file isn’t stable in the subprocess environment? What happens if you re-trigger the task? Does it eventually succeed or does is always fail?

Re-runs eventually succeed.

If there are any patches which would be useful to apply to our Cylc installs to add some extra logging in the event of this failure, please let us know what to add and we can patch it in.

1 Like

I assume the output you’ve shown is from job-activity.log?
The first time you get a submit failure, what do you see in scheduler/log?

Ok, if it eventually succeeds, then this suggests that the global config can sometimes load differently in different scheduler subprocesses. One possibility is that differences in the subprocess environment are causing it. Presumably there’s some check in the global config that determines the version you get which depends on the environment?

add some extra logging in the event of this failure, please let us know what to add and we can patch it in.

You can log the environment that the global config is loaded in like this:

#!Jinja2

{% from "cylc.flow" import LOG %}
{% do LOG.info(environ) %}

https://cylc.github.io/cylc-doc/stable/html/user-guide/writing-workflows/jinja2.html#logging

You might want to filter that to pertinent variables.

An update, we are thinking it might be network blips causing hosts to be condemned, such that a platform no longer has any hosts available as they are all condemned.

For example, we may see somethnig like this error message in the logs:

2024-05-13T23:57:27Z WARNING - platform: user-my_platform - Could not connect to hpc.remote.address.alias.
    * hpc.remote.address.alias has been added to the list of unreachable hosts
    * job-logs-retrieve will retry if another host is available.

Then later, something like: (ignore the times, these are from different logs)

2024-05-14T00:25:14Z ERROR - [jobs-submit cmd] (platform not defined)
    [jobs-submit ret_code] 1
    [jobs-submit err] Unable to find a platform from group platform_group1.

In the global.cylc file, something like

[platforms]
    [[user-my_platform]]
	    hosts = hpc.remoate.address.alias

[platform groups]
	[[platform_group1]]
		platforms = user-my_platform

So, it seems like a network blip that resolves itself within a second, is condemning the hosts, and then the platform is no longer usable, and then submission fails.

Does this sound likely/possible? If so, I think the error message in the log isn’t ideal, as it would ideally say “No usable hosts found for platform ‘user-my_platform’ as they are all condemned” or something like that.

I’ve not looked at the logs, but is it possible to either

  1. Require host issues to occur over a couple of minutes before condemning
  2. Not allow a host to become condemned

The term “condemned” is specific to scheduler hosts, not job hosts. That’s how we tell Cylc schedulers to move off a Cylc VM that (e.g.) needs to be taken down for maintenance.

For job platforms, if Cylc can’t reach a platform host, the unreachable host will be taken off the list for a while until the list resets, and another host tried instead. Similarly for platform groups (I think).

If that’s what’s happening, then I agree that the error message “(platform not defined)” is misleading. The follow-up “Unable to find a platform from group platform_group1” is better, but we should probably have a special alert for running out of reachable platforms entirely. (Although I think that should result in an immediate list reset … we might need to check that for platform groups).

To your specific questions, ignoring the incorrect terminology:

  1. not at present, but I’m not sure it’s sensible to keep trying the same unreachable host?
  2. ditto

If you switch to using a platform rather than a platform group it should work fine.

The bad hosts list is cleared if a task cannot be submitted because all of the hosts it might use cannot be reached - see https://cylc.github.io/cylc-doc/stable/html/plugins/main-loop/built-in/cylc.flow.main_loop.reset_bad_hosts.html#module-cylc.flow.main_loop.reset_bad_hosts
This is not currently happening for platform groups - we’ll need to investigate this.

Thanks @hilary.j.oliver and @dpmatthews for the information.

A small (maybe) extra few questions. How do the hosts lists work in a platform?

Does it always just go to the first item in the host list, and if unreachable, the second, etc? or Does it pick a random host and use that?

Also, if there are multiple hosts, one is marked as unreachable, how often does it re-try that one to check again?

The selection method is configurable (random by default): https://cylc.github.io/cylc-doc/stable/html/reference/config/global.html#global.cylc[platforms][%3Cplatform%20name%3E][selection]method

The bad hosts list is reset every 30 min (by default): https://cylc.github.io/cylc-doc/stable/html/reference/config/global.html#global.cylc[scheduler][main%20loop][reset%20bad%20hosts]interval

If all hosts of a platform fail we remove those hosts from the stored set of uncontactable hosts to allow for submission retries.

If a single host goes down for a bit, but other hosts within the platform remain available (or Cylc doesn’t attempt contact during a comms blip) then it will be reset by the main loop plugin, as mentioned by @dpmatthews .

It looks like we might not be doing that in the case of platform groups, which is probably a bug.