Reset bad hosts list from CLI?

I sometimes have jobs stall due to expired authentication credentials. I then see something like the following in my cylc log output:

2023-09-05T23:49:50Z WARNING - platform: None - Could not connect to mustang.afrl.hpc.mil.
    * mustang.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-poll will retry if another host is available.

I can renew the authentication credentials to fix the original problem, but it seems that cylc still regards the host as inaccessible and doesn’t attempt any operations that require connecting to the remote host. For instance, manually triggering tasks fails immediately with a submit-failed result. Eventually the flow stalls and exits.

Is there some way to remove the offending host from the “list of unreachable hosts” mentioned in the log? Looking at the documentation, I see a function cylc.flow.main_loop.reset_bad_hosts that looks like it might do the trick. Would I be correct in surmising that calling this function enable cylc to attempt connections to the failed host again? If so, is there a way to invoke this function from the CLI?

Yes, that’s right.

https://cylc.github.io/cylc-doc/stable/html/plugins/main-loop/built-in/cylc.flow.main_loop.reset_bad_hosts.html

Main loop plugins are loaded at start-up, and configured in global.cylc.

The reset_bad_hosts plugin should be loaded by default, with a reset interval of 30 minutes (configurable).

To check:

$ cylc config --defaults | grep -A 5 -B 5 'reset bad hosts'
        smtp =
        to =
        footer =
        task event batch interval = PT5M
    [[main loop]]
        plugins = health check, reset bad hosts  # <----
        [[[health check]]]
            interval = PT10M
        [[[auto restart]]]
            interval = PT10M
        [[[reset bad hosts]]]  # <----
            interval = PT30M
    [[logging]]
        rolling archive length = 15
        maximum size in bytes = 1000000
[install]

So, this should be working already. Maybe the reset interval is too long for you?

You could check that it’s working by changing the interval to say PT1M (1 minute), deliberately disabling your host credientials, and retriggering a task a few times.

[We have not implemented a CLI command to trigger a reset]

1 Like

Another way to test this is define a platform with a short list of non-existent hosts, and repeatedly trigger a task (assigned to that platform) in a simple test workflow. See here for details if you want to try it: Platforms bug: bad hosts (mainly logging) · Issue #5719 · cylc/cylc-flow · GitHub

My tests confirm that the bad hosts clearing main loop plugin works as advertised, but the logging of bad hosts processing leaves something to be desired (we’ll fix it).

That’s not needed because simply retriggering a task will clear the list of bad hosts, to start again, if there are no good hosts left.

1 Like

I checked the output from cylc config --defaults as you suggested and it does indeed show the reset bad hosts plugin loaded with an interval of 30 minutes. But…

With that being the case, I probably don’t need to adjust the interval since I can just manually trigger any failed submissions after renewing my credentials. And I can confirm manual triggering is working for me now after destroying and renewing my host credentials (it seemed not to be earlier, but perhaps that was due to a quirk in my environment).

Thanks so much for your help with this!

1 Like

It may have seemed that way if you didn’t try one more time - as per my GitHub issue link above, currently there does seem to be one unnecessary job submission failure before the bad hosts list gets cleared.

Returning to this as the behaviour reported in the GitHub issue linked above may have due to me working in a screwed-up development branch at the time :stuck_out_tongue_closed_eyes:. What is supposed to happen, and which I’ve now confirmed does happen, is that Cylc should automatically try the next host in the platform list if it can’t connect to one, until the list is exhausted. At that point the workflow will stall (unless there is ongoing activity elsewhere in the graph), but if you manually retrigger the task that will reset the bad hosts list and it will try all the hosts again.

Here’s the scheduler output showing that, for my fake-hosts example above:

$ cylc vip --no-detach --no-timestamp
# cylc validate /home/oliverh/cylc-src/one
Valid for cylc-8.2.1
# cylc install /home/oliverh/cylc-src/one
INSTALLED one/run8 from /home/oliverh/cylc-src/one
# cylc play --no-detach --no-timestamp one/run8

 ▪ ■  Cylc Workflow Engine 8.2.1
 ██   Copyright (C) 2008-2023 NIWA
▝▘    & British Crown (Met Office) & Contributors

INFO - Extracting job.sh to /home/oliverh/cylc-run/one/run8/.service/etc/job.sh
INFO - Workflow: one/run8
...
INFO - New flow: 1 (original flow from 1) 2023-09-06 22:12:49
INFO - [1/foo waiting(runahead) job:00 flows:1] => waiting
INFO - [1/foo waiting job:00 flows:1] => waiting(queued)
INFO - [1/foo waiting(queued) job:00 flows:1] => waiting
INFO - [1/foo waiting job:01 flows:1] => preparing
WARNING - platform: None - Could not connect to zippo.  # <---- 
    * zippo has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
WARNING - platform: None - Could not connect to nada.  # <----
    * nada has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
WARNING - platform: None - Could not connect to nein.  # <----
    * nein has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1
CRITICAL - [1/foo preparing job:01 flows:1] submission failed
INFO - [1/foo preparing job:01 flows:1] => submit-failed
CRITICAL - platform: fake - initialisation did not complete (no hosts were
    reachable)
ERROR - Incomplete tasks:
      * 1/foo did not complete required outputs: ['succeeded']
CRITICAL - Workflow stalled
WARNING - PT1H stall timer starts NOW

Then, it will try all 3 hosts again if I do cylc trigger one//1/foo.

Is that what you see, or does it not automatically retry until the host list is exhausted?

I only had one host configured in the flow I was testing with originally. I’ll try running your fake-hosts example and see what I get with that. I can also change my platform setup to use more hosts (the machine I’m using has multiple login nodes, I just didn’t list them all in my cylc config).

Thanks, good to know - that totally explains why you were not seeing automatic retries for other available hosts.

Finally got around to testing this myself. Thank you for all your work helping me understand and test this. I do see automatic retries on the fake hosts test:

2023-09-11T09:31:01-04:00 WARNING - platform: None - Could not connect to zippo.
    * zippo has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T09:31:02-04:00 WARNING - platform: None - Could not connect to nein.
    * nein has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T09:31:03-04:00 WARNING - platform: None - Could not connect to nada.
    * nada has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T09:31:03-04:00 ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1
2023-09-11T09:31:03-04:00 CRITICAL - [1/foo preparing job:01 flows:1] submission failed
2023-09-11T09:31:03-04:00 INFO - [1/foo preparing job:01 flows:1] => submit-failed
2023-09-11T09:31:03-04:00 CRITICAL - platform: fake - initialisation did not complete (no hosts were reachable)
2023-09-11T09:31:03-04:00 ERROR - Incomplete tasks:
      * 1/foo did not complete required outputs: ['succeeded']
2023-09-11T09:31:03-04:00 CRITICAL - Workflow stalled
2023-09-11T09:31:03-04:00 WARNING - PT1H stall timer starts NOW

After force triggering, all 3 hosts are tried again:

2023-09-11T09:35:01-04:00 INFO - [1/foo waiting job:02 flows:1] => preparing
2023-09-11T09:35:01-04:00 WARNING - stall timer stopped
2023-09-11T09:35:02-04:00 WARNING - platform: None - Could not connect to nein.
    * nein has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T09:35:03-04:00 WARNING - platform: None - Could not connect to nada.
    * nada has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T09:35:04-04:00 WARNING - platform: None - Could not connect to zippo.
    * zippo has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T09:35:04-04:00 ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1
2023-09-11T09:35:04-04:00 CRITICAL - [1/foo preparing job:02 flows:1] submission failed
2023-09-11T09:35:04-04:00 INFO - [1/foo preparing job:02 flows:1] => submit-failed
2023-09-11T09:35:04-04:00 CRITICAL - platform: fake - initialisation did not complete (no hosts were reachable)
2023-09-11T09:35:04-04:00 ERROR - Incomplete tasks:
      * 1/foo did not complete required outputs: ['succeeded']
2023-09-11T09:35:04-04:00 CRITICAL - Workflow stalled
2023-09-11T09:35:04-04:00 WARNING - PT1H stall timer starts NOW

When I try on my original flow, but with more hosts (14) added to the setup, I see 4 automatic retries each time, with hosts apparently selected at random:

2023-09-11T13:03:24Z INFO - Command actioned: force_trigger_tasks(['20150107T1620Z/compute_predictions'], flow=['all'], flow_wait=False, flow_descr=None)
2023-09-11T13:03:24Z INFO - [20150107T1620Z/compute_predictions waiting job:02 flows:1] => preparing
2023-09-11T13:03:24Z WARNING - stall timer stopped
2023-09-11T13:03:25Z WARNING - platform: None - Could not connect to [mustang.afrl.hpc.mil.
    * [mustang.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T13:03:26Z WARNING - platform: None - Could not connect to mustang04.afrl.hpc.mil.
    * mustang04.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T13:03:27Z WARNING - platform: None - Could not connect to mustang02.afrl.hpc.mil.
    * mustang02.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T13:03:47Z WARNING - platform: None - Could not connect to mustang08.afrl.hpc.mil.
    * mustang08.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T13:03:47Z CRITICAL - [20150107T1620Z/compute_predictions preparing job:02 flows:1] submission failed
2023-09-11T13:03:47Z INFO - [20150107T1620Z/compute_predictions preparing job:02 flows:1] => submit-failed
2023-09-11T13:03:47Z ERROR - Incomplete tasks:
      * 20150107T1620Z/compute_predictions did not complete required outputs: ['succeeded']
2023-09-11T13:03:47Z WARNING - Partially satisfied prerequisites:
      * 20150108T1200Z/compute_predictions is waiting on ['20150108T1155Z/advance_member11:succeeded', '20150108T1155Z/advance_member10:succeeded', '20150108T1155Z/advance_member05:succeeded', '20150108T1155Z/advance_member04:succeeded', '20150108T1155Z/advance_member02:succeeded', '20150108T1155Z/advance_member06:succeeded', '20150108T1155Z/advance_member03:succeeded', '20150108T1155Z/advance_member08:succeeded', '20150108T1155Z/advance_member01:succeeded', '20150108T1155Z/advance_member15:succeeded', '20150108T1155Z/advance_member12:succeeded', '20150108T1155Z/advance_member09:succeeded', '20150108T1155Z/advance_member07:succeeded', '20150108T1155Z/advance_member13:succeeded', '20150108T1155Z/advance_member14:succeeded']
2023-09-11T13:03:47Z CRITICAL - Workflow stalled
2023-09-11T13:03:47Z WARNING - PT1H stall timer starts NOW

So apparently there is a limit on the number of retries, but I do see automatic retries as expected.

I noticed another quirk in testing this, which I’m not sure how to troubleshoot but which doesn’t seem like expected behavior. As I said above, if a flow stalls because of failed submissions after I manually destroy my login credentials, then I’m able to get it un-stalled by manually triggering tasks, as expected. However, that doesn’t work if the flow stalls due to the login credentials expiring (i.e. without destroying the credentials manually). In the case of expired credentials manual triggering doesn’t work, even after renewing my login credentials. I can get it running again by stopping and re-starting the scheduler, but from what you said above this doesn’t seem like expected behavior.

No, there is no limit. It should exhaust the list of hosts before the job goes to the submit-failed state.

I’ve just retested with an extended list of fake hosts, and it ran through them all. Cylc doesn’t know if a host is “fake” or just unreachable, so that shouldn’t matter.

Are you sure that your run had picked up the global config change to add more hosts?

Hmmm that’s odd. Could that be down to the difference between “destroyed” (deleted?) and “expired” credentials? In the former case, perhaps the connection is rejected earlier in the process. In the latter, what happens exactly? Do you see any messages in the scheduler log, or in the job activity log?

Yes, it definitely did because the newly added hosts show up in the log messages (mustang02.afrl.hpc.mil, mustang04.afrl.hpc.mil, and mustang08.afrl.hpc.mil are all new hosts that weren’t in the platform config before). So the only reason for it not to try all of them is if some of the hosts were already on the bad hosts list? I had assumed the bad hosts list was only stored in memory and destroyed when the scheduler exited, but the log output I quoted above that listed only 4 hosts was the first time the bad hosts list was mentioned in the output from that particular scheduler run.

It seems like it could be, but I’m not sure what the mechanism would be. I assume the scheduler doesn’t know or care how the authentication is handled, which means, as you say, it’s probably down to how the timing of when the failure occurs, or some other aspect that the scheduler actually knows about. Or, the behavior is due to a quirk of how the authentication system. The system in question uses Kerberos as the authentication mechanism, but I’m not totally sure how the Kerberos credentials cache is stored. So it might be that there’s some mechanism that restricts what system processes can access the credentials cache. In which case there’s nothing cylc can do about it, and all I can do is restart the scheduler to recover.

When my credentials expire I get log messages that look more or less the same as what I see when I destroy them:

2023-09-11T23:17:35Z WARNING - platform: None - Could not connect to mustang12.afrl.hpc.mil.
    * mustang12.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:36Z WARNING - platform: None - Could not connect to mustang11.afrl.hpc.mil.
    * mustang11.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:37Z WARNING - platform: None - Could not connect to mustang06.afrl.hpc.mil.
    * mustang06.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:38Z WARNING - platform: None - Could not connect to mustang02.afrl.hpc.mil.
    * mustang02.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:39Z WARNING - platform: None - Could not connect to mustang07.afrl.hpc.mil.
    * mustang07.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:40Z WARNING - platform: None - Could not connect to mustang05.afrl.hpc.mil.
    * mustang05.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:41Z WARNING - platform: None - Could not connect to mustang08.afrl.hpc.mil.
    * mustang08.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:42Z WARNING - platform: None - Could not connect to mustang09.afrl.hpc.mil.
    * mustang09.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:43Z WARNING - platform: None - Could not connect to mustang13.afrl.hpc.mil.
    * mustang13.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:44Z WARNING - platform: None - Could not connect to mustang14.afrl.hpc.mil].
    * mustang14.afrl.hpc.mil] has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:45Z WARNING - platform: None - Could not connect to [mustang.afrl.hpc.mil.
    * [mustang.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:46Z WARNING - platform: None - Could not connect to mustang04.afrl.hpc.mil.
    * mustang04.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:47Z WARNING - platform: None - Could not connect to mustang10.afrl.hpc.mil.
    * mustang10.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:48Z WARNING - platform: None - Could not connect to mustang01.afrl.hpc.mil.
    * mustang01.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:49Z WARNING - platform: None - Could not connect to mustang03.afrl.hpc.mil.
    * mustang03.afrl.hpc.mil has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2023-09-11T23:17:49Z ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1
2023-09-11T23:17:49Z CRITICAL - [20150107T1645Z/compute_predictions preparing job:01 flows:1] submission failed
2023-09-11T23:17:49Z INFO - [20150107T1645Z/compute_predictions preparing job:01 flows:1] => submit-failed
2023-09-11T23:17:49Z CRITICAL - platform: mustang - initialisation did not complete (no hosts were reachable)
2023-09-11T23:17:49Z ERROR - Incomplete tasks:
      * 20150107T1645Z/compute_predictions did not complete required outputs: ['succeeded']
2023-09-11T23:17:49Z WARNING - Partially satisfied prerequisites:
      * 20150108T1200Z/compute_predictions is waiting on ['20150108T1155Z/advance_member04:succeeded', '20150108T1155Z/advance_member02:succeeded', '20150108T1155Z/advance_member12:succeeded', '20150108T1155Z/advance_member07:succeeded', '20150108T1155Z/advance_member09:succeeded', '20150108T1155Z/advance_member11:succeeded', '20150108T1155Z/advance_member08:succeeded', '20150108T1155Z/advance_member13:succeeded', '20150108T1155Z/advance_member05:succeeded', '20150108T1155Z/advance_member14:succeeded', '20150108T1155Z/advance_member10:succeeded', '20150108T1155Z/advance_member03:succeeded', '20150108T1155Z/advance_member15:succeeded', '20150108T1155Z/advance_member06:succeeded', '20150108T1155Z/advance_member01:succeeded']
2023-09-11T23:17:49Z CRITICAL - Workflow stalled
2023-09-11T23:17:49Z WARNING - PT1H stall timer starts NOW
2023-09-12T00:17:49Z WARNING - stall timer timed out after PT1H
2023-09-12T00:17:49Z ERROR - Workflow shutting down - "abort on stall timeout" is set
2023-09-12T00:17:49Z INFO - DONE

Note this time it actually did try all the hosts (15 of them) before giving up. So whatever happened before to cause the scheduler to only try 4 hosts must have been specific to that run.

Some general points:

  • Platform hosts lists don’t have a numerical limit.
  • Bad hosts are stored in memory attached to the scheduler and do not persist beyond the scheduler.
  • Host selection is random. You can make it definition order, but that’s only going to be useful if you have a preferred and a backup host (and this scenario will often be better described as two platforms in a platform group).

One possible scenario that might have led to only 4 hosts being tried is if a previous task in the workflow failed and didn’t have any more retries, bad hosts would not have been cleared[1]. If you then reloaded the new task might only try the new hosts. This doesn’t (judging by the mustangX numbers) look to be the case here.

1 Except by periodic clearance