Polling reporting submit-failed/failed when jobs still in queue

Hi

We’ve recently encountered an intermittent issue (probably occurring now a couple of times a week) where polling of a task has incorrectly returned submit-failed or failed when the task is still in the queue/running (Slurm). This is occurring in 3 long running suites using cylc-7.8.7.

I found this topic -
Succeeded jobs polling as failed - however I don’t think this is relevant as the task(s) being polled has not changed state.

log/suite/log output

2022-09-24T15:46:03Z INFO - [atmos_main.20000601T0000Z] -submit-num=01, owner@host=login.archer2.ac.uk
2022-09-24T15:46:03Z INFO - [rose_arch_pe_output.20000501T0000Z] -submit-num=01, owner@host=login.archer2.ac.uk
2022-09-24T15:46:03Z INFO - [postproc.20000501T0000Z] -submit-num=01, owner@host=login.archer2.ac.uk
2022-09-24T15:46:06Z INFO - [atmos_main.20000601T0000Z] status=ready: (internal)submitted at 2022-09-24T15:46:05Z for job(01)
2022-09-24T15:46:06Z INFO - [atmos_main.20000601T0000Z] -health check settings: submission timeout=None, polling intervals=PT5M,...
2022-09-24T15:46:06Z INFO - [postproc.20000501T0000Z] status=ready: (internal)submitted at 2022-09-24T15:46:05Z for job(01)
2022-09-24T15:46:06Z INFO - [postproc.20000501T0000Z] -health check settings: submission timeout=None, polling intervals=PT5M,...
2022-09-24T15:46:06Z INFO - [rose_arch_pe_output.20000501T0000Z] status=ready: (internal)submitted at 2022-09-24T15:46:05Z for job(01)
2022-09-24T15:46:06Z INFO - [rose_arch_pe_output.20000501T0000Z] -health check settings: submission timeout=None, polling intervals=PT5M,...
2022-09-24T15:51:06Z INFO - [atmos_main.20000601T0000Z] -poll now, (next in PT5M (after 2022-09-24T15:56:06Z))
2022-09-24T15:51:06Z INFO - [rose_arch_pe_output.20000501T0000Z] -poll now, (next in PT5M (after 2022-09-24T15:56:06Z))
2022-09-24T15:51:06Z INFO - [postproc.20000501T0000Z] -poll now, (next in PT5M (after 2022-09-24T15:56:06Z))
2022-09-24T15:52:43Z INFO - [atmos_main.20000601T0000Z] status=submitted: (polled)submission failed at 2022-09-24T15:46:05Z for job(01)
2022-09-24T15:52:43Z ERROR - [atmos_main.20000601T0000Z] -submission failed
2022-09-24T15:52:43Z INFO - [postproc.20000501T0000Z] status=submitted: (polled)failed at 2022-09-24T15:52:43Z for job(01)
2022-09-24T15:52:43Z CRITICAL - [postproc.20000501T0000Z] -job(01) failed

In the atmos_main.20000601T0000Z job.status file we have an incorrect CYLC_BATCH_SYS_EXIT_POLLED and the job goes on to start running and succeed!

CYLC_BATCH_SYS_NAME=slurm
CYLC_BATCH_SYS_JOB_ID=2361100
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-09-24T15:46:05Z
CYLC_BATCH_SYS_EXIT_POLLED=2022-09-24T15:52:43Z
CYLC_JOB_PID=34385
CYLC_JOB_INIT_TIME=2022-09-24T22:04:52Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-09-25T03:47:35Z

Similarly for the postproc task:

CYLC_BATCH_SYS_NAME=slurm
CYLC_BATCH_SYS_JOB_ID=2361101
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-09-24T15:46:05Z
CYLC_JOB_PID=93086
CYLC_JOB_INIT_TIME=2022-09-24T15:46:14Z
CYLC_BATCH_SYS_EXIT_POLLED=2022-09-24T15:52:43Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-09-24T16:40:52Z

I’m guessing it’s an issue at the ARCHER2 end or a connection issue - before July/August time it was running fine with cylc-7.8.7. I’ve also noticed that if the suites happen to be polling at exactly the same time; then if one suffers this issue they both suffer it - which is at least consistent. I need to get more information from the cylc polling to figure out exactly what is going on - that’s assuming I can recreate the problem with a test suite! :grimacing:. The polling code is a bit of a mine field; I’m trying to put some debug statements in jobs_poll and _jobs_poll_batch_sys within batch_sys_manage.py but failing dismally again. I tried LOG.info and sys.stdout.write with no luck. First question: Is this the correct place to be looking and if so how should I be writing out debug statements? Secondly if you have any suggestions as to what could be going on that would be great. :smiley: :pray:

Thanks in advance.
Cheers,
Ros.

The job-activity.log file for this job contains the following:

[jobs-poll out] 2022-09-24T15:52:43Z|20000601T0000Z/atmos_main/01|{"batch_sys_name": "slurm", "batch_sys_job_id": "2361100", "batch_sys_exit_polled": 1, "time_submit_exit": "2022-09-24T15:46:05Z"}

(I have access to the system where this is running)
"batch_sys_exit_polled": 1 implies that the job was no longer in the Slurm queue.
The question is why?
I’m going to work with Ros to investigate further …

Probably explained by this known bug: https://github.com/cylc/cylc-flow/issues/3436