Hi. I’m having issues recently where jobs successfully run, but then register as failed and are resubmitted.
I found some similar topics but none that seemed close enough to be helpful.
I’m submitting to PBS. My communication method is default, whatever that is (running at NCI), which is not “poll” I’m pretty sure. But there is some polling happening as well.
Here are the job-activity.log, job.status, and suite log lines relevant, for tries 1 and 2.
job-activity.log
[jobs-submit ret_code] 0
[jobs-submit out] 2022-02-02T03:40:57Z|20211126T0500Z/ukv_um_fcst_05/01|0|34539416.gadi-pbs
(gadi.nci.org.au) 2022-02-02T03:40:57Z [STDOUT] 34539416.gadi-pbs
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T03:48:16Z|20211126T0500Z/ukv_um_fcst_05/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "34539416.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:41:47Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T03:51:38Z|20211126T0500Z/ukv_um_fcst_05/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "34539416.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:41:47Z", "time_run": "2022-02-02T03:48:39Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T03:53:24Z|20211126T0500Z/ukv_um_fcst_05/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "34539416.gadi-pbs", "batch_sys_exit_polled": 1, "time_submit_exit": "2022-02-02T03:41:47Z", "time_run": "2022-02-02T03:48:39Z"}
[(('job-logs-retrieve', 'succeeded'), 1) ret_code] 0
[(('job-logs-retrieve', 'retry'), 1) ret_code] 0
[jobs-submit ret_code] 0
[jobs-submit out] 2022-02-02T03:54:23Z|20211126T0500Z/ukv_um_fcst_05/02|0|34540021.gadi-pbs
(gadi.nci.org.au) 2022-02-02T03:54:23Z [STDOUT] 34540021.gadi-pbs
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T04:01:37Z|20211126T0500Z/ukv_um_fcst_05/02|{"batch_sys_name": "pbs", "batch_sys_job_id": "34540021.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:54:23Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T04:03:57Z|20211126T0500Z/ukv_um_fcst_05/02|{"batch_sys_name": "pbs", "batch_sys_job_id": "34540021.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:54:23Z", "time_run": "2022-02-02T04:01:44Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T04:06:26Z|20211126T0500Z/ukv_um_fcst_05/02|{"batch_sys_name": "pbs", "batch_sys_job_id": "34540021.gadi-pbs", "run_status": 0, "time_submit_exit": "2022-02-02T03:54:23Z", "time_run": "2022-02-02T04:01:44Z", "time_run_exit": "2022-02-02T04:06:05Z"}
[(('job-logs-retrieve', 'succeeded'), 2) ret_code] 0
job.status
CYLC_BATCH_SYS_NAME=pbs
CYLC_BATCH_SYS_JOB_ID=34539416.gadi-pbs
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-02-02T03:41:47Z
CYLC_JOB_PID=3113620
CYLC_JOB_INIT_TIME=2022-02-02T03:48:39Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-02-02T03:53:06Z
CYLC_BATCH_SYS_EXIT_POLLED=2022-02-02T03:53:24Z
CYLC_BATCH_SYS_NAME=pbs
CYLC_BATCH_SYS_JOB_ID=34540021.gadi-pbs
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-02-02T03:54:23Z
CYLC_JOB_PID=3115075
CYLC_JOB_INIT_TIME=2022-02-02T04:01:44Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-02-02T04:06:05Z
suite log
2022-02-02T03:40:44Z INFO - [ukv_um_fcst_05.20211126T0500Z] -submit-num=01, owner@host=gadi.nci.org.au
2022-02-02T03:41:47Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=ready: (internal)submitted at 2022-02-02T03:40:57Z for job(01)
2022-02-02T03:41:47Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: submission timeout=P1D, polling intervals=30*PT5M,...
2022-02-02T03:46:47Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T03:51:47Z))
2022-02-02T03:48:16Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (polled)submitted at 2022-02-02T03:41:47Z for job(01)
2022-02-02T03:48:41Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (received)started at 2022-02-02T03:48:39Z for job(01)
2022-02-02T03:48:41Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: execution timeout=PT24M, polling intervals=2*PT2M,2*PT5M,PT1M,PT2M,PT7M,...
2022-02-02T03:50:41Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT2M (after 2022-02-02T03:52:41Z))
2022-02-02T03:51:38Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (polled)started at 2022-02-02T03:48:39Z for job(01)
2022-02-02T03:52:55Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T03:57:55Z))
2022-02-02T03:53:10Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (received)succeeded at 2022-02-02T03:53:06Z for job(01)
2022-02-02T03:53:26Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=succeeded: (polled)failed at 2022-02-02T03:53:26Z for job(01)
2022-02-02T03:53:26Z INFO - [ukv_um_fcst_05.20211126T0500Z] -job(01) failed, retrying in PT30S (after 2022-02-02T03:53:56Z)
2022-02-02T03:54:02Z INFO - [ukv_um_fcst_05.20211126T0500Z] -submit-num=02, owner@host=gadi.nci.org.au
2022-02-02T03:54:36Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=ready: (internal)submitted at 2022-02-02T03:54:23Z for job(02)
2022-02-02T03:54:36Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: submission timeout=P1D, polling intervals=30*PT5M,...
2022-02-02T03:59:59Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T04:04:59Z))
2022-02-02T04:01:49Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (polled)submitted at 2022-02-02T03:54:23Z for job(02)
2022-02-02T04:01:50Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (received)started at 2022-02-02T04:01:44Z for job(02)
2022-02-02T04:01:50Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: execution timeout=PT24M, polling intervals=2*PT2M,2*PT5M,PT1M,PT2M,PT7M,...
2022-02-02T04:03:50Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT2M (after 2022-02-02T04:05:50Z))
2022-02-02T04:04:05Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (polled)started at 2022-02-02T04:01:44Z for job(02)
2022-02-02T04:06:07Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T04:11:07Z))
2022-02-02T04:06:07Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (received)succeeded at 2022-02-02T04:06:05Z for job(02)
2022-02-02T04:06:27Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=succeeded: (polled)succeeded at 2022-02-02T04:06:05Z for job(02)
The key lines appear to be these
2022-02-02T03:53:10Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (received)succeeded at 2022-02-02T03:53:06Z for job(01)
2022-02-02T03:53:26Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=succeeded: (polled)failed at 2022-02-02T03:53:26Z for job(01)
The job returns a message that it is successful, then polling is run and returns a failed state.
I’m not sure why cylc is polling or why it is returning a failed state. I’m not sure what cylc is polling either. Therefore, I don’t know if there is any way to fix it, or where to look.
I’ve only been noticing this behaviour in the last week, so it could be a result of system issues. The PBS server seems to be overloaded as well lately. There are myriad issues (and often are), but this one I haven’t noticed before.
It’s also common for a job to run, and the suite thinks it is still in the submitted state; I’m not curretnly worried about this because polling would normally resolve it.