I am having an issue where if job submission to a host fails, it will not try any of the other hosts in the platform group. This seems similar to: Reset bad hosts list from CLI? - #12 by wxtim
If I try this with some fake hosts (as in Hilary’s test) it does seem to work as expected.
With global.conf
:
[platforms]
[[nein, nada, zippo]]
install target = dummy
[platform groups]
[[fake]]
platforms = nein, nada, zippo
Test workflow:
[scheduling]
[[graph]]
R1 = "foo"
[runtime]
[[foo]]
platform = fake
The log/scheduler/log
looks like this:
2024-09-11T15:04:10+01:00 INFO - platform: zippo - remote init (on zippo)
2024-09-11T15:04:11+01:00 WARNING - platform: zippo - Could not connect to zippo.
* zippo has been added to the list of unreachable hosts
* remote-init will retry if another host is available.
2024-09-11T15:04:11+01:00 INFO - platform: nada - remote init (on nada)
2024-09-11T15:04:12+01:00 WARNING - platform: nada - Could not connect to nada.
* nada has been added to the list of unreachable hosts
* remote-init will retry if another host is available.
2024-09-11T15:04:12+01:00 INFO - platform: nein - remote init (on nein)
2024-09-11T15:04:13+01:00 WARNING - platform: nein - Could not connect to nein.
* nein has been added to the list of unreachable hosts
* remote-init will retry if another host is available.
2024-09-11T15:04:13+01:00 ERROR - [jobs-submit cmd] (remote init)
[jobs-submit ret_code] 1
2024-09-11T15:04:13+01:00 CRITICAL - [1/foo/01:preparing] submission failed
2024-09-11T15:04:13+01:00 INFO - [1/foo/01:preparing] => submit-failed
However if I run a simple test with a subset of the real hosts it does not try them all.
global.conf
:
[platforms]
[[sci4.jasmin.ac.uk, sci5.jasmin.ac.uk, sci6.jasmin.ac.uk]]
install target = jasmin
[platform groups]
[[sci_bg]]
platforms = sci4.jasmin.ac.uk, sci5.jasmin.ac.uk, sci6.jasmin.ac.uk
In this case submission to sci4 and sci5 will work, but sci6 is not available at the moment so will fail.
My test workflow:
[scheduling]
cycling mode = integer
initial cycle point = 1
final cycle point = 10
[[graph]]
P1 = "foo[-P1] => foo"
[runtime]
[[foo]]
platform = sci_bg
Then the log/scheduler/log
file looks like this:
2024-09-11T18:28:27+01:00 INFO - Final point: 10
2024-09-11T18:28:27+01:00 INFO - Cold start from 1
2024-09-11T18:28:27+01:00 INFO - New flow: 1 (original flow from 1) 2024-09-11T18:28:27
2024-09-11T18:28:27+01:00 INFO - [1/foo:waiting(runahead)] => waiting
2024-09-11T18:28:27+01:00 INFO - [1/foo:waiting] => waiting(queued)
2024-09-11T18:28:27+01:00 INFO - [1/foo:waiting(queued)] => waiting
2024-09-11T18:28:27+01:00 INFO - [1/foo:waiting] => preparing
2024-09-11T18:28:27+01:00 INFO - platform: sci4.jasmin.ac.uk - remote init (on sci4.jasmin.ac.uk)
2024-09-11T18:28:30+01:00 INFO - platform: sci4.jasmin.ac.uk - remote file install (on sci4.jasmin.ac.uk)
2024-09-11T18:28:31+01:00 INFO - platform: sci4.jasmin.ac.uk - remote file install complete
2024-09-11T18:28:33+01:00 INFO - [1/foo/01:preparing] submitted to sci4.jasmin.ac.uk:background[23420]
2024-09-11T18:28:33+01:00 INFO - [1/foo/01:preparing] => submitted
2024-09-11T18:29:54+01:00 INFO - Command "poll_tasks" received. ID=96cc4840-f011-4a6b-bdbe-036480928611
poll_tasks(tasks=['1/*'])
2024-09-11T18:29:54+01:00 INFO - Command "poll_tasks" actioned. ID=96cc4840-f011-4a6b-bdbe-036480928611
2024-09-11T18:29:56+01:00 INFO - [1/foo/01:submitted] (polled)succeeded
2024-09-11T18:29:56+01:00 INFO - [1/foo/01:submitted] setting implied output: started
2024-09-11T18:29:56+01:00 INFO - [1/foo/01:submitted] => running
2024-09-11T18:29:56+01:00 INFO - [1/foo/01:running] => succeeded
2024-09-11T18:29:56+01:00 INFO - [2/foo:waiting(runahead)] => waiting
2024-09-11T18:29:56+01:00 INFO - [2/foo:waiting] => waiting(queued)
2024-09-11T18:29:56+01:00 INFO - [2/foo:waiting(queued)] => waiting
2024-09-11T18:29:56+01:00 INFO - [2/foo:waiting] => preparing
2024-09-11T18:30:07+01:00 WARNING - platform: sci_bg - Could not connect to sci6.jasmin.ac.uk.
* sci6.jasmin.ac.uk has been added to the list of unreachable hosts
* jobs-submit will retry if another host is available.
2024-09-11T18:30:07+01:00 CRITICAL - [2/foo/01:preparing] submission failed
2024-09-11T18:30:07+01:00 INFO - [2/foo/01:preparing] => submit-failed
2024-09-11T18:30:07+01:00 WARNING - [2/foo/01:submit-failed] did not complete the required outputs:
⨯ ⦙ succeeded
2024-09-11T18:30:07+01:00 ERROR - Incomplete tasks:
* 2/foo did not complete the required outputs:
⨯ ⦙ succeeded
2024-09-11T18:30:07+01:00 CRITICAL - Workflow stalled
2024-09-11T18:30:07+01:00 WARNING - PT1H stall timer starts NOW
Why doesn’t it try any other hosts? Does it depend on the reason for the failure? Am I missing something here?