Finally got around to testing this myself. Thank you for all your work helping me understand and test this. I do see automatic retries on the fake hosts test:
2023-09-11T09:31:01-04:00 WARNING - platform: None - Could not connect to zippo.
* zippo has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T09:31:02-04:00 WARNING - platform: None - Could not connect to nein.
* nein has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T09:31:03-04:00 WARNING - platform: None - Could not connect to nada.
* nada has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T09:31:03-04:00 ERROR - [jobs-submit cmd] (remote init)
[jobs-submit ret_code] 1
2023-09-11T09:31:03-04:00 CRITICAL - [1/foo preparing job:01 flows:1] submission failed
2023-09-11T09:31:03-04:00 INFO - [1/foo preparing job:01 flows:1] => submit-failed
2023-09-11T09:31:03-04:00 CRITICAL - platform: fake - initialisation did not complete (no hosts were reachable)
2023-09-11T09:31:03-04:00 ERROR - Incomplete tasks:
* 1/foo did not complete required outputs: ['succeeded']
2023-09-11T09:31:03-04:00 CRITICAL - Workflow stalled
2023-09-11T09:31:03-04:00 WARNING - PT1H stall timer starts NOW
After force triggering, all 3 hosts are tried again:
2023-09-11T09:35:01-04:00 INFO - [1/foo waiting job:02 flows:1] => preparing
2023-09-11T09:35:01-04:00 WARNING - stall timer stopped
2023-09-11T09:35:02-04:00 WARNING - platform: None - Could not connect to nein.
* nein has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T09:35:03-04:00 WARNING - platform: None - Could not connect to nada.
* nada has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T09:35:04-04:00 WARNING - platform: None - Could not connect to zippo.
* zippo has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T09:35:04-04:00 ERROR - [jobs-submit cmd] (remote init)
[jobs-submit ret_code] 1
2023-09-11T09:35:04-04:00 CRITICAL - [1/foo preparing job:02 flows:1] submission failed
2023-09-11T09:35:04-04:00 INFO - [1/foo preparing job:02 flows:1] => submit-failed
2023-09-11T09:35:04-04:00 CRITICAL - platform: fake - initialisation did not complete (no hosts were reachable)
2023-09-11T09:35:04-04:00 ERROR - Incomplete tasks:
* 1/foo did not complete required outputs: ['succeeded']
2023-09-11T09:35:04-04:00 CRITICAL - Workflow stalled
2023-09-11T09:35:04-04:00 WARNING - PT1H stall timer starts NOW
When I try on my original flow, but with more hosts (14) added to the setup, I see 4 automatic retries each time, with hosts apparently selected at random:
2023-09-11T13:03:24Z INFO - Command actioned: force_trigger_tasks(['20150107T1620Z/compute_predictions'], flow=['all'], flow_wait=False, flow_descr=None)
2023-09-11T13:03:24Z INFO - [20150107T1620Z/compute_predictions waiting job:02 flows:1] => preparing
2023-09-11T13:03:24Z WARNING - stall timer stopped
2023-09-11T13:03:25Z WARNING - platform: None - Could not connect to [mustang.afrl.hpc.mil.
* [mustang.afrl.hpc.mil has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T13:03:26Z WARNING - platform: None - Could not connect to mustang04.afrl.hpc.mil.
* mustang04.afrl.hpc.mil has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T13:03:27Z WARNING - platform: None - Could not connect to mustang02.afrl.hpc.mil.
* mustang02.afrl.hpc.mil has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T13:03:47Z WARNING - platform: None - Could not connect to mustang08.afrl.hpc.mil.
* mustang08.afrl.hpc.mil has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2023-09-11T13:03:47Z CRITICAL - [20150107T1620Z/compute_predictions preparing job:02 flows:1] submission failed
2023-09-11T13:03:47Z INFO - [20150107T1620Z/compute_predictions preparing job:02 flows:1] => submit-failed
2023-09-11T13:03:47Z ERROR - Incomplete tasks:
* 20150107T1620Z/compute_predictions did not complete required outputs: ['succeeded']
2023-09-11T13:03:47Z WARNING - Partially satisfied prerequisites:
* 20150108T1200Z/compute_predictions is waiting on ['20150108T1155Z/advance_member11:succeeded', '20150108T1155Z/advance_member10:succeeded', '20150108T1155Z/advance_member05:succeeded', '20150108T1155Z/advance_member04:succeeded', '20150108T1155Z/advance_member02:succeeded', '20150108T1155Z/advance_member06:succeeded', '20150108T1155Z/advance_member03:succeeded', '20150108T1155Z/advance_member08:succeeded', '20150108T1155Z/advance_member01:succeeded', '20150108T1155Z/advance_member15:succeeded', '20150108T1155Z/advance_member12:succeeded', '20150108T1155Z/advance_member09:succeeded', '20150108T1155Z/advance_member07:succeeded', '20150108T1155Z/advance_member13:succeeded', '20150108T1155Z/advance_member14:succeeded']
2023-09-11T13:03:47Z CRITICAL - Workflow stalled
2023-09-11T13:03:47Z WARNING - PT1H stall timer starts NOW
So apparently there is a limit on the number of retries, but I do see automatic retries as expected.
I noticed another quirk in testing this, which Iâm not sure how to troubleshoot but which doesnât seem like expected behavior. As I said above, if a flow stalls because of failed submissions after I manually destroy my login credentials, then Iâm able to get it un-stalled by manually triggering tasks, as expected. However, that doesnât work if the flow stalls due to the login credentials expiring (i.e. without destroying the credentials manually). In the case of expired credentials manual triggering doesnât work, even after renewing my login credentials. I can get it running again by stopping and re-starting the scheduler, but from what you said above this doesnât seem like expected behavior.