Hi
We’ve recently encountered an intermittent issue (probably occurring now a couple of times a week) where polling of a task has incorrectly returned submit-failed or failed when the task is still in the queue/running (Slurm). This is occurring in 3 long running suites using cylc-7.8.7.
I found this topic -
Succeeded jobs polling as failed - however I don’t think this is relevant as the task(s) being polled has not changed state.
log/suite/log output
2022-09-24T15:46:03Z INFO - [atmos_main.20000601T0000Z] -submit-num=01, owner@host=login.archer2.ac.uk
2022-09-24T15:46:03Z INFO - [rose_arch_pe_output.20000501T0000Z] -submit-num=01, owner@host=login.archer2.ac.uk
2022-09-24T15:46:03Z INFO - [postproc.20000501T0000Z] -submit-num=01, owner@host=login.archer2.ac.uk
2022-09-24T15:46:06Z INFO - [atmos_main.20000601T0000Z] status=ready: (internal)submitted at 2022-09-24T15:46:05Z for job(01)
2022-09-24T15:46:06Z INFO - [atmos_main.20000601T0000Z] -health check settings: submission timeout=None, polling intervals=PT5M,...
2022-09-24T15:46:06Z INFO - [postproc.20000501T0000Z] status=ready: (internal)submitted at 2022-09-24T15:46:05Z for job(01)
2022-09-24T15:46:06Z INFO - [postproc.20000501T0000Z] -health check settings: submission timeout=None, polling intervals=PT5M,...
2022-09-24T15:46:06Z INFO - [rose_arch_pe_output.20000501T0000Z] status=ready: (internal)submitted at 2022-09-24T15:46:05Z for job(01)
2022-09-24T15:46:06Z INFO - [rose_arch_pe_output.20000501T0000Z] -health check settings: submission timeout=None, polling intervals=PT5M,...
2022-09-24T15:51:06Z INFO - [atmos_main.20000601T0000Z] -poll now, (next in PT5M (after 2022-09-24T15:56:06Z))
2022-09-24T15:51:06Z INFO - [rose_arch_pe_output.20000501T0000Z] -poll now, (next in PT5M (after 2022-09-24T15:56:06Z))
2022-09-24T15:51:06Z INFO - [postproc.20000501T0000Z] -poll now, (next in PT5M (after 2022-09-24T15:56:06Z))
2022-09-24T15:52:43Z INFO - [atmos_main.20000601T0000Z] status=submitted: (polled)submission failed at 2022-09-24T15:46:05Z for job(01)
2022-09-24T15:52:43Z ERROR - [atmos_main.20000601T0000Z] -submission failed
2022-09-24T15:52:43Z INFO - [postproc.20000501T0000Z] status=submitted: (polled)failed at 2022-09-24T15:52:43Z for job(01)
2022-09-24T15:52:43Z CRITICAL - [postproc.20000501T0000Z] -job(01) failed
In the atmos_main.20000601T0000Z
job.status
file we have an incorrect CYLC_BATCH_SYS_EXIT_POLLED and the job goes on to start running and succeed!
CYLC_BATCH_SYS_NAME=slurm
CYLC_BATCH_SYS_JOB_ID=2361100
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-09-24T15:46:05Z
CYLC_BATCH_SYS_EXIT_POLLED=2022-09-24T15:52:43Z
CYLC_JOB_PID=34385
CYLC_JOB_INIT_TIME=2022-09-24T22:04:52Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-09-25T03:47:35Z
Similarly for the postproc task:
CYLC_BATCH_SYS_NAME=slurm
CYLC_BATCH_SYS_JOB_ID=2361101
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-09-24T15:46:05Z
CYLC_JOB_PID=93086
CYLC_JOB_INIT_TIME=2022-09-24T15:46:14Z
CYLC_BATCH_SYS_EXIT_POLLED=2022-09-24T15:52:43Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-09-24T16:40:52Z
I’m guessing it’s an issue at the ARCHER2 end or a connection issue - before July/August time it was running fine with cylc-7.8.7. I’ve also noticed that if the suites happen to be polling at exactly the same time; then if one suffers this issue they both suffer it - which is at least consistent. I need to get more information from the cylc polling to figure out exactly what is going on - that’s assuming I can recreate the problem with a test suite! . The polling code is a bit of a mine field; I’m trying to put some debug statements in jobs_poll
and _jobs_poll_batch_sys
within batch_sys_manage.py
but failing dismally again. I tried LOG.info
and sys.stdout.write
with no luck. First question: Is this the correct place to be looking and if so how should I be writing out debug statements? Secondly if you have any suggestions as to what could be going on that would be great.
Thanks in advance.
Cheers,
Ros.