Succeeded jobs polling as failed

Hi. I’m having issues recently where jobs successfully run, but then register as failed and are resubmitted.
I found some similar topics but none that seemed close enough to be helpful.

I’m submitting to PBS. My communication method is default, whatever that is (running at NCI), which is not “poll” I’m pretty sure. But there is some polling happening as well.

Here are the job-activity.log, job.status, and suite log lines relevant, for tries 1 and 2.

job-activity.log

[jobs-submit ret_code] 0
[jobs-submit out] 2022-02-02T03:40:57Z|20211126T0500Z/ukv_um_fcst_05/01|0|34539416.gadi-pbs
(gadi.nci.org.au) 2022-02-02T03:40:57Z [STDOUT] 34539416.gadi-pbs
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T03:48:16Z|20211126T0500Z/ukv_um_fcst_05/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "34539416.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:41:47Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T03:51:38Z|20211126T0500Z/ukv_um_fcst_05/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "34539416.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:41:47Z", "time_run": "2022-02-02T03:48:39Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T03:53:24Z|20211126T0500Z/ukv_um_fcst_05/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "34539416.gadi-pbs", "batch_sys_exit_polled": 1, "time_submit_exit": "2022-02-02T03:41:47Z", "time_run": "2022-02-02T03:48:39Z"}
[(('job-logs-retrieve', 'succeeded'), 1) ret_code] 0
[(('job-logs-retrieve', 'retry'), 1) ret_code] 0


[jobs-submit ret_code] 0
[jobs-submit out] 2022-02-02T03:54:23Z|20211126T0500Z/ukv_um_fcst_05/02|0|34540021.gadi-pbs
(gadi.nci.org.au) 2022-02-02T03:54:23Z [STDOUT] 34540021.gadi-pbs
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T04:01:37Z|20211126T0500Z/ukv_um_fcst_05/02|{"batch_sys_name": "pbs", "batch_sys_job_id": "34540021.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:54:23Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T04:03:57Z|20211126T0500Z/ukv_um_fcst_05/02|{"batch_sys_name": "pbs", "batch_sys_job_id": "34540021.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:54:23Z", "time_run": "2022-02-02T04:01:44Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T04:06:26Z|20211126T0500Z/ukv_um_fcst_05/02|{"batch_sys_name": "pbs", "batch_sys_job_id": "34540021.gadi-pbs", "run_status": 0, "time_submit_exit": "2022-02-02T03:54:23Z", "time_run": "2022-02-02T04:01:44Z", "time_run_exit": "2022-02-02T04:06:05Z"}
[(('job-logs-retrieve', 'succeeded'), 2) ret_code] 0

job.status

CYLC_BATCH_SYS_NAME=pbs
CYLC_BATCH_SYS_JOB_ID=34539416.gadi-pbs
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-02-02T03:41:47Z
CYLC_JOB_PID=3113620
CYLC_JOB_INIT_TIME=2022-02-02T03:48:39Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-02-02T03:53:06Z
CYLC_BATCH_SYS_EXIT_POLLED=2022-02-02T03:53:24Z

CYLC_BATCH_SYS_NAME=pbs
CYLC_BATCH_SYS_JOB_ID=34540021.gadi-pbs
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-02-02T03:54:23Z
CYLC_JOB_PID=3115075
CYLC_JOB_INIT_TIME=2022-02-02T04:01:44Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-02-02T04:06:05Z

suite log

2022-02-02T03:40:44Z INFO - [ukv_um_fcst_05.20211126T0500Z] -submit-num=01, owner@host=gadi.nci.org.au
2022-02-02T03:41:47Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=ready: (internal)submitted at 2022-02-02T03:40:57Z for job(01)
2022-02-02T03:41:47Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: submission timeout=P1D, polling intervals=30*PT5M,...
2022-02-02T03:46:47Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T03:51:47Z))
2022-02-02T03:48:16Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (polled)submitted at 2022-02-02T03:41:47Z for job(01)
2022-02-02T03:48:41Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (received)started at 2022-02-02T03:48:39Z for job(01)
2022-02-02T03:48:41Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: execution timeout=PT24M, polling intervals=2*PT2M,2*PT5M,PT1M,PT2M,PT7M,...
2022-02-02T03:50:41Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT2M (after 2022-02-02T03:52:41Z))
2022-02-02T03:51:38Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (polled)started at 2022-02-02T03:48:39Z for job(01)
2022-02-02T03:52:55Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T03:57:55Z))
2022-02-02T03:53:10Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (received)succeeded at 2022-02-02T03:53:06Z for job(01)
2022-02-02T03:53:26Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=succeeded: (polled)failed at 2022-02-02T03:53:26Z for job(01)
2022-02-02T03:53:26Z INFO - [ukv_um_fcst_05.20211126T0500Z] -job(01) failed, retrying in PT30S (after 2022-02-02T03:53:56Z)
2022-02-02T03:54:02Z INFO - [ukv_um_fcst_05.20211126T0500Z] -submit-num=02, owner@host=gadi.nci.org.au
2022-02-02T03:54:36Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=ready: (internal)submitted at 2022-02-02T03:54:23Z for job(02)
2022-02-02T03:54:36Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: submission timeout=P1D, polling intervals=30*PT5M,...
2022-02-02T03:59:59Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T04:04:59Z))
2022-02-02T04:01:49Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (polled)submitted at 2022-02-02T03:54:23Z for job(02)
2022-02-02T04:01:50Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (received)started at 2022-02-02T04:01:44Z for job(02)
2022-02-02T04:01:50Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: execution timeout=PT24M, polling intervals=2*PT2M,2*PT5M,PT1M,PT2M,PT7M,...
2022-02-02T04:03:50Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT2M (after 2022-02-02T04:05:50Z))
2022-02-02T04:04:05Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (polled)started at 2022-02-02T04:01:44Z for job(02)
2022-02-02T04:06:07Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T04:11:07Z))
2022-02-02T04:06:07Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (received)succeeded at 2022-02-02T04:06:05Z for job(02)
2022-02-02T04:06:27Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=succeeded: (polled)succeeded at 2022-02-02T04:06:05Z for job(02)

The key lines appear to be these

2022-02-02T03:53:10Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (received)succeeded at 2022-02-02T03:53:06Z for job(01)
2022-02-02T03:53:26Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=succeeded: (polled)failed at 2022-02-02T03:53:26Z for job(01)

The job returns a message that it is successful, then polling is run and returns a failed state.

I’m not sure why cylc is polling or why it is returning a failed state. I’m not sure what cylc is polling either. Therefore, I don’t know if there is any way to fix it, or where to look.

I’ve only been noticing this behaviour in the last week, so it could be a result of system issues. The PBS server seems to be overloaded as well lately. There are myriad issues (and often are), but this one I haven’t noticed before.
It’s also common for a job to run, and the suite thinks it is still in the submitted state; I’m not curretnly worried about this because polling would normally resolve it.

I can also see cases just now where the task has succeeded, and cylc poll returns a status of submitted or submit-failed, not just failed.

2022-02-02T05:10:53Z INFO - [ukv_ops_process_airs.20211126T0700Z] -submit-num=01, owner@host=gadi.nci.org.au
2022-02-02T05:11:37Z INFO - [ukv_ops_process_airs.20211126T0700Z] status=ready: (internal)submitted at 2022-02-02T05:11:19Z for job(01)
2022-02-02T05:11:37Z INFO - [ukv_ops_process_airs.20211126T0700Z] -health check settings: submission timeout=P1D, polling intervals=30*PT5M,...
2022-02-02T05:16:46Z INFO - [ukv_ops_process_airs.20211126T0700Z] -poll now, (next in PT5M (after 2022-02-02T05:21:46Z))
2022-02-02T05:17:17Z INFO - [ukv_ops_process_airs.20211126T0700Z] status=submitted: (received)started at 2022-02-02T05:17:06Z for job(01)
2022-02-02T05:17:17Z INFO - [ukv_ops_process_airs.20211126T0700Z] -health check settings: execution timeout=PT20M, polling intervals=2*PT2M,PT5M,2*PT2M,PT7M,...
2022-02-02T05:17:34Z INFO - [ukv_ops_process_airs.20211126T0700Z] status=running: (received)succeeded at 2022-02-02T05:17:34Z for job(01)
2022-02-02T05:21:34Z INFO - [ukv_ops_process_airs.20211126T0700Z] status=succeeded: (polled)submitted at 2022-02-02T05:11:31Z for job(01)

It looks like it is either polling way after the job returns a succeeded message, or that the polling at 5:16:46 didn’t return its result until 5:21:34. Either way, this doesn’t explain the failure state for the job that didn’t fail.

I think this is the issue?

If so it was fixed in Cylc 7.9.5 / 7.8.10. Which version are you using?

If PBS is struggling that badly you’re likely to have problems.
You are adding to the PBS load with the frequent polling (every 5 minutes whilst a job is queued and then more frequently whilst it is running). I suggest you take a look at your polling setup (presumably in the cylc global config).

I saw that one. It didn’t seem related because it didn’t explain why a job that had already reported back as succeeded would shortly then be polled and return a failed state.

I can understand how if a job finishes, and it completes before polling, that cylc might think it is submit-failed. I can sort of imagine how a job could complete quickly, and a polling would receive back a submitted state in the interim. I can’t imagine how it could come up with a failed state. How could it find a failed status?

Although given that I’m getting all the variations, I suppose we don’t want to imagine that there’s more than one bug happening.

I am using 7.8.3, which is the default on the system. We only have as recent as 7.8.8 installed though.

I suppose if we reduce the polling, it would reduce the load a bit. On the other hand, since we also sometimes get communication issues between remote and local hosts, then you end up with people adding extra polling to their suites because they get jobs that appear to be running or submitted, which actually finished an hour ago. Damned either way, really.

Yes, PBS has been struggling occasionally. Though I don’t know enough about cylc to be sure that that could be related to polling issues. That’s partly why I was asking.

The poll which results in the failed state reports the job as started but no longer in the queue - this is treated as a failure (on the assumption that the job has died without reporting an exit status). Although that wasn’t the error I tested for I think the slow polling can also explain this.

This is also related to https://github.com/cylc/cylc-flow/issues/4513 which questions (amongst other things) why we allow a succeeded task to have it’s state changed via polling.

Aha! Thanks for explaining that.