Submission failed but not trying other hosts in platform group

I am having an issue where if job submission to a host fails, it will not try any of the other hosts in the platform group. This seems similar to: Reset bad hosts list from CLI? - #12 by wxtim

If I try this with some fake hosts (as in Hilary’s test) it does seem to work as expected.
With global.conf:

[platforms]
    [[nein, nada, zippo]]
        install target = dummy
[platform groups]
    [[fake]]
        platforms = nein, nada, zippo

Test workflow:

[scheduling]
   [[graph]]
      R1 = "foo"
[runtime]
   [[foo]]
      platform = fake

The log/scheduler/log looks like this:

2024-09-11T15:04:10+01:00 INFO - platform: zippo - remote init (on zippo)
2024-09-11T15:04:11+01:00 WARNING - platform: zippo - Could not connect to zippo.
    * zippo has been added to the list of unreachable hosts
    * remote-init will retry if another host is available.
2024-09-11T15:04:11+01:00 INFO - platform: nada - remote init (on nada)
2024-09-11T15:04:12+01:00 WARNING - platform: nada - Could not connect to nada.
    * nada has been added to the list of unreachable hosts
    * remote-init will retry if another host is available.
2024-09-11T15:04:12+01:00 INFO - platform: nein - remote init (on nein)
2024-09-11T15:04:13+01:00 WARNING - platform: nein - Could not connect to nein.
    * nein has been added to the list of unreachable hosts
    * remote-init will retry if another host is available.
2024-09-11T15:04:13+01:00 ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1
2024-09-11T15:04:13+01:00 CRITICAL - [1/foo/01:preparing] submission failed
2024-09-11T15:04:13+01:00 INFO - [1/foo/01:preparing] => submit-failed

However if I run a simple test with a subset of the real hosts it does not try them all.
global.conf:

[platforms]
    [[sci4.jasmin.ac.uk, sci5.jasmin.ac.uk, sci6.jasmin.ac.uk]]
        install target = jasmin
[platform groups]
    [[sci_bg]]
        platforms = sci4.jasmin.ac.uk, sci5.jasmin.ac.uk, sci6.jasmin.ac.uk

In this case submission to sci4 and sci5 will work, but sci6 is not available at the moment so will fail.

My test workflow:

[scheduling]
   cycling mode = integer
   initial cycle point = 1
   final cycle point = 10
   [[graph]]
      P1 = "foo[-P1] => foo"
[runtime]
   [[foo]]
      platform = sci_bg

Then the log/scheduler/log file looks like this:

2024-09-11T18:28:27+01:00 INFO - Final point: 10
2024-09-11T18:28:27+01:00 INFO - Cold start from 1
2024-09-11T18:28:27+01:00 INFO - New flow: 1 (original flow from 1) 2024-09-11T18:28:27
2024-09-11T18:28:27+01:00 INFO - [1/foo:waiting(runahead)] => waiting
2024-09-11T18:28:27+01:00 INFO - [1/foo:waiting] => waiting(queued)
2024-09-11T18:28:27+01:00 INFO - [1/foo:waiting(queued)] => waiting
2024-09-11T18:28:27+01:00 INFO - [1/foo:waiting] => preparing
2024-09-11T18:28:27+01:00 INFO - platform: sci4.jasmin.ac.uk - remote init (on sci4.jasmin.ac.uk)
2024-09-11T18:28:30+01:00 INFO - platform: sci4.jasmin.ac.uk - remote file install (on sci4.jasmin.ac.uk)
2024-09-11T18:28:31+01:00 INFO - platform: sci4.jasmin.ac.uk - remote file install complete
2024-09-11T18:28:33+01:00 INFO - [1/foo/01:preparing] submitted to sci4.jasmin.ac.uk:background[23420]
2024-09-11T18:28:33+01:00 INFO - [1/foo/01:preparing] => submitted
2024-09-11T18:29:54+01:00 INFO - Command "poll_tasks" received. ID=96cc4840-f011-4a6b-bdbe-036480928611
    poll_tasks(tasks=['1/*'])
2024-09-11T18:29:54+01:00 INFO - Command "poll_tasks" actioned. ID=96cc4840-f011-4a6b-bdbe-036480928611
2024-09-11T18:29:56+01:00 INFO - [1/foo/01:submitted] (polled)succeeded
2024-09-11T18:29:56+01:00 INFO - [1/foo/01:submitted] setting implied output: started
2024-09-11T18:29:56+01:00 INFO - [1/foo/01:submitted] => running
2024-09-11T18:29:56+01:00 INFO - [1/foo/01:running] => succeeded
2024-09-11T18:29:56+01:00 INFO - [2/foo:waiting(runahead)] => waiting
2024-09-11T18:29:56+01:00 INFO - [2/foo:waiting] => waiting(queued)
2024-09-11T18:29:56+01:00 INFO - [2/foo:waiting(queued)] => waiting
2024-09-11T18:29:56+01:00 INFO - [2/foo:waiting] => preparing
2024-09-11T18:30:07+01:00 WARNING - platform: sci_bg - Could not connect to sci6.jasmin.ac.uk.
    * sci6.jasmin.ac.uk has been added to the list of unreachable hosts
    * jobs-submit will retry if another host is available.
2024-09-11T18:30:07+01:00 CRITICAL - [2/foo/01:preparing] submission failed
2024-09-11T18:30:07+01:00 INFO - [2/foo/01:preparing] => submit-failed
2024-09-11T18:30:07+01:00 WARNING - [2/foo/01:submit-failed] did not complete the required outputs:
    ⨯ ⦙  succeeded
2024-09-11T18:30:07+01:00 ERROR - Incomplete tasks:
    * 2/foo did not complete the required outputs:
      ⨯ ⦙  succeeded
2024-09-11T18:30:07+01:00 CRITICAL - Workflow stalled
2024-09-11T18:30:07+01:00 WARNING - PT1H stall timer starts NOW

Why doesn’t it try any other hosts? Does it depend on the reason for the failure? Am I missing something here?

(Hi @annette.osprey - that does look like a bug, but at first cut I’ll leave it to the Platforms experts at the UK end).

Hi, could you include which version of Cylc you are using, and try running with --debug so we can see what the platform error you’re getting is? Thanks

This is cylc 8.3.3. Running with --debug gives this:

2024-09-12T11:23:00+01:00 DEBUG - Starting
2024-09-12T11:23:00+01:00 DEBUG - Configure curve: *[/home/n02/n02/annette_test/cylc-run/test_jasmin2/run4/.service/client_public_keys]
2024-09-12T11:23:00+01:00 INFO - Workflow: test_jasmin2/run4
2024-09-12T11:23:00+01:00 DEBUG - Reading file /home/n02/n02/annette_test/cylc-run/test_jasmin2/run4/flow.cylc
2024-09-12T11:23:00+01:00 DEBUG - ran rose in 0.00019s
2024-09-12T11:23:00+01:00 DEBUG - Processed configuration dumped: /home/n02/n02/annette_test/cylc-run/test_jasmin2/run4/log/config/flow-processed.cylc
2024-09-12T11:23:01+01:00 DEBUG - Expanding [runtime] namespace lists and parameters
2024-09-12T11:23:01+01:00 DEBUG - Parsing the runtime namespace hierarchy
2024-09-12T11:23:01+01:00 DEBUG - Parsing [special tasks]
2024-09-12T11:23:01+01:00 DEBUG - Parsing the dependency graph
2024-09-12T11:23:01+01:00 INFO - Scheduler: url=tcp://puma2.archer2.ac.uk:43073 pid=4155112
2024-09-12T11:23:01+01:00 INFO - Workflow publisher: url=tcp://puma2.archer2.ac.uk:43001
2024-09-12T11:23:01+01:00 INFO - Run: (re)start number=1, log rollover=1
2024-09-12T11:23:01+01:00 INFO - Cylc version: 8.3.3
2024-09-12T11:23:01+01:00 INFO - Run mode: live
2024-09-12T11:23:01+01:00 INFO - Initial point: 1
2024-09-12T11:23:01+01:00 INFO - Final point: 10
2024-09-12T11:23:01+01:00 INFO - Cold start from 1
2024-09-12T11:23:01+01:00 INFO - New flow: 1 (original flow from 1) 2024-09-12T11:23:01
2024-09-12T11:23:01+01:00 DEBUG - Runahead: base point 1
2024-09-12T11:23:01+01:00 DEBUG - Runahead limit: 5
2024-09-12T11:23:01+01:00 DEBUG - [1/foo:waiting(runahead)] added to active task pool
2024-09-12T11:23:01+01:00 INFO - [1/foo:waiting(runahead)] => waiting
2024-09-12T11:23:01+01:00 INFO - [1/foo:waiting] => waiting(queued)
2024-09-12T11:23:01+01:00 DEBUG - Loaded main loop plugin "health check":
    * health_check
2024-09-12T11:23:01+01:00 DEBUG - Loaded main loop plugin "reset bad hosts":
    * reset_bad_hosts
2024-09-12T11:23:01+01:00 INFO - [1/foo:waiting(queued)] => waiting
2024-09-12T11:23:01+01:00 INFO - [1/foo:waiting] => preparing
2024-09-12T11:23:01+01:00 INFO - platform: sci6.jasmin.ac.uk - remote init (on sci6.jasmin.ac.uk)
2024-09-12T11:23:01+01:00 DEBUG - main_loop [run] cylc.flow.main_loop.health_check:health_check
2024-09-12T11:23:01+01:00 DEBUG - main_loop [end] cylc.flow.main_loop.health_check:health_check (0.001s)
2024-09-12T11:23:01+01:00 DEBUG - main_loop [run] cylc.flow.main_loop.reset_bad_hosts:reset_bad_hosts
2024-09-12T11:23:01+01:00 DEBUG - main_loop [end] cylc.flow.main_loop.reset_bad_hosts:reset_bad_hosts (0.000s)
2024-09-12T11:23:01+01:00 DEBUG - ['ssh', '-oBatchMode=yes', '-oConnectTimeout=10', 'sci6.jasmin.ac.uk', 'env', 'CYLC_VERSION=8.3.3', 'CYLC_CONF_PATH=/home/n02/n02/annette/debug/jasmin', 'CYLC_ENV_NAME=cylc-8.3.3-1', 'bash', '--login', '-c', '\'exec "$0" "$@"\'', 'cylc', 'remote-init', '-v', '-v', 'jasmin', '$HOME/cylc-run/test_jasmin2/run4']
2024-09-12T11:23:11+01:00 DEBUG - [remote-init cmd] cat '<tempfile.SpooledTemporaryFile object at 0x7f21e0064910>' | ssh -oBatchMode=yes -oConnectTimeout=10 sci6.jasmin.ac.uk env CYLC_VERSION=8.3.3 CYLC_CONF_PATH=/home/n02/n02/annette/debug/jasmin CYLC_ENV_NAME=cylc-8.3.3-1 bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc remote-init -v -v jasmin '$HOME/cylc-run/test_jasmin2/run4'
    [remote-init ret_code] 255
    [remote-init err]
    Access to this system is monitored and restricted to authorised users. If you
    do not have authorisation to use this system, you should not proceed beyond
    this point and should disconnect immediately.
    Unauthorised use could lead to prosecution.
    channel 0: open failed: connect failed: No route to host
    stdio forwarding failed
    kex_exchange_identification: Connection closed by remote host
2024-09-12T11:23:11+01:00 WARNING - platform: sci6.jasmin.ac.uk - Could not connect to sci6.jasmin.ac.uk.
    * sci6.jasmin.ac.uk has been added to the list of unreachable hosts
    * remote-init will retry if another host is available.
2024-09-12T11:23:11+01:00 ERROR - platform: sci6.jasmin.ac.uk - initialisation did not complete
    COMMAND:
        ssh -oBatchMode=yes -oConnectTimeout=10 sci6.jasmin.ac.uk \
            env CYLC_VERSION=8.3.3 \
            CYLC_CONF_PATH=/home/n02/n02/annette/debug/jasmin \
            CYLC_ENV_NAME=cylc-8.3.3-1 bash --login -c \
            'exec "$0" "$@"' cylc remote-init -v -v jasmin \
            $HOME/cylc-run/test_jasmin2/run4
    RETURN CODE:
        255
    STDERR:
        Access to this system is monitored and restricted to authorised users. If you
        do not have authorisation to use this system, you should not proceed beyond
        this point and should disconnect immediately.
        Unauthorised use could lead to prosecution.
        channel 0: open failed: connect failed: No route to host
        stdio forwarding failed
        kex_exchange_identification: Connection closed by remote host
2024-09-12T11:23:11+01:00 DEBUG - [1/foo/01:preparing] host=sci4.jasmin.ac.uk
2024-09-12T11:23:11+01:00 ERROR - [jobs-submit cmd] (init sci4.jasmin.ac.uk)
    [jobs-submit ret_code] 1
    [jobs-submit err] REMOTE INIT FAILED
2024-09-12T11:23:11+01:00 ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1
2024-09-12T11:23:11+01:00 DEBUG - [1/foo/01:preparing] (internal)submission failed
2024-09-12T11:23:11+01:00 CRITICAL - [1/foo/01:preparing] submission failed
2024-09-12T11:23:11+01:00 INFO - [1/foo/01:preparing] => submit-failed
2024-09-12T11:23:11+01:00 WARNING - [1/foo/01:submit-failed] did not complete the required outputs:
    ⨯ ⦙  succeeded
2024-09-12T11:23:11+01:00 DEBUG - 1/foo -triggered off ['0/foo'] in flow 1
2024-09-12T11:23:11+01:00 ERROR - Incomplete tasks:
    * 1/foo did not complete the required outputs:
      ⨯ ⦙  succeeded
2024-09-12T11:23:11+01:00 CRITICAL - Workflow stalled
2024-09-12T11:23:11+01:00 WARNING - PT1H stall timer starts NOW

Thanks, it looks like a bug. I’m afraid most of our team is on leave so it may take a while to investigate properly

These lines are interesting:

INFO - [1/foo:waiting] => preparing
INFO - platform: sci6.jasmin.ac.uk - remote init (on sci6.jasmin.ac.uk)
ERROR - [jobs-submit cmd] (init sci4.jasmin.ac.uk)

Cylc tried to submit a job on sci6, but later, it said that job submission failed on sci4. So how did we get from server 6 to server 4?

I don’t know anything about the servers in question (I don’t have Jasmin access). Taking a look at this page, I wonder whether these are truly separate servers, or whether some of them are hosted on the others? Servers 3, 5 & 8 are described as “physical” whereas the (much smaller) 1, 2, 4, 6 & 7 as “virtual”. If these “virtual” servers share DNS with the “physical” servers (on which they are hosted?), then they might look to Cylc like the same thing which could explain the issue.

The description on that page makes it sound like these servers are provisioned for interactive work rather than for running workflows on. I found the Cylc config for Jasmin, it configures a single platform called “lotus”. I don’t know if that helps?

Thanks for your reply Oliver.

Just to set the context, there is a cylc server at Jasmin for running workflows on the lotus cluster. However I am working on puma2, the cylc server attached to the Archer2 HPC. The bulk of the workload is run on the HPC, but the data is then transferred to Jasmin for archiving/further processing. As part of our workflows we need to be able to run tasks on the sci nodes, for example, to upload data to the tape service.

From what you’ve said, I suspect the issue is that we access all of the sci nodes via the same proxy host (a specific jasmin login node). Do you think my sci node platform group just isn’t going to work?

As far as I can see it should work fine (I don’t think DNS or proxy servers are likely to be the issue) - I’ll try to reproduce

I can reproduce the problem but, so far, I can’t work out what’s special about this case.
Investigations will continue …

Hi,

We’ve managed to narrow this down to a bug in Cylc (apologies). We aim to get this bug fixed for the next release (8.3.5), we’ll announce the release here when it’s made. It should be deployed on Jasmin soon after.

Cheers

That’s great thank you. I will install on puma2/archer2 when it’s available.

Annette

1 Like