Succeeded jobs polling as failed

srennie · February 2, 2022, 4:36am

Hi. I’m having issues recently where jobs successfully run, but then register as failed and are resubmitted.
I found some similar topics but none that seemed close enough to be helpful.

I’m submitting to PBS. My communication method is default, whatever that is (running at NCI), which is not “poll” I’m pretty sure. But there is some polling happening as well.

Here are the job-activity.log, job.status, and suite log lines relevant, for tries 1 and 2.

job-activity.log

[jobs-submit ret_code] 0
[jobs-submit out] 2022-02-02T03:40:57Z|20211126T0500Z/ukv_um_fcst_05/01|0|34539416.gadi-pbs
(gadi.nci.org.au) 2022-02-02T03:40:57Z [STDOUT] 34539416.gadi-pbs
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T03:48:16Z|20211126T0500Z/ukv_um_fcst_05/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "34539416.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:41:47Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T03:51:38Z|20211126T0500Z/ukv_um_fcst_05/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "34539416.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:41:47Z", "time_run": "2022-02-02T03:48:39Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T03:53:24Z|20211126T0500Z/ukv_um_fcst_05/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "34539416.gadi-pbs", "batch_sys_exit_polled": 1, "time_submit_exit": "2022-02-02T03:41:47Z", "time_run": "2022-02-02T03:48:39Z"}
[(('job-logs-retrieve', 'succeeded'), 1) ret_code] 0
[(('job-logs-retrieve', 'retry'), 1) ret_code] 0


[jobs-submit ret_code] 0
[jobs-submit out] 2022-02-02T03:54:23Z|20211126T0500Z/ukv_um_fcst_05/02|0|34540021.gadi-pbs
(gadi.nci.org.au) 2022-02-02T03:54:23Z [STDOUT] 34540021.gadi-pbs
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T04:01:37Z|20211126T0500Z/ukv_um_fcst_05/02|{"batch_sys_name": "pbs", "batch_sys_job_id": "34540021.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:54:23Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T04:03:57Z|20211126T0500Z/ukv_um_fcst_05/02|{"batch_sys_name": "pbs", "batch_sys_job_id": "34540021.gadi-pbs", "batch_sys_exit_polled": 0, "time_submit_exit": "2022-02-02T03:54:23Z", "time_run": "2022-02-02T04:01:44Z"}
[jobs-poll ret_code] 0
[jobs-poll out] 2022-02-02T04:06:26Z|20211126T0500Z/ukv_um_fcst_05/02|{"batch_sys_name": "pbs", "batch_sys_job_id": "34540021.gadi-pbs", "run_status": 0, "time_submit_exit": "2022-02-02T03:54:23Z", "time_run": "2022-02-02T04:01:44Z", "time_run_exit": "2022-02-02T04:06:05Z"}
[(('job-logs-retrieve', 'succeeded'), 2) ret_code] 0

job.status

CYLC_BATCH_SYS_NAME=pbs
CYLC_BATCH_SYS_JOB_ID=34539416.gadi-pbs
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-02-02T03:41:47Z
CYLC_JOB_PID=3113620
CYLC_JOB_INIT_TIME=2022-02-02T03:48:39Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-02-02T03:53:06Z
CYLC_BATCH_SYS_EXIT_POLLED=2022-02-02T03:53:24Z

CYLC_BATCH_SYS_NAME=pbs
CYLC_BATCH_SYS_JOB_ID=34540021.gadi-pbs
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2022-02-02T03:54:23Z
CYLC_JOB_PID=3115075
CYLC_JOB_INIT_TIME=2022-02-02T04:01:44Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2022-02-02T04:06:05Z

suite log

2022-02-02T03:40:44Z INFO - [ukv_um_fcst_05.20211126T0500Z] -submit-num=01, owner@host=gadi.nci.org.au
2022-02-02T03:41:47Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=ready: (internal)submitted at 2022-02-02T03:40:57Z for job(01)
2022-02-02T03:41:47Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: submission timeout=P1D, polling intervals=30*PT5M,...
2022-02-02T03:46:47Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T03:51:47Z))
2022-02-02T03:48:16Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (polled)submitted at 2022-02-02T03:41:47Z for job(01)
2022-02-02T03:48:41Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (received)started at 2022-02-02T03:48:39Z for job(01)
2022-02-02T03:48:41Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: execution timeout=PT24M, polling intervals=2*PT2M,2*PT5M,PT1M,PT2M,PT7M,...
2022-02-02T03:50:41Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT2M (after 2022-02-02T03:52:41Z))
2022-02-02T03:51:38Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (polled)started at 2022-02-02T03:48:39Z for job(01)
2022-02-02T03:52:55Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T03:57:55Z))
2022-02-02T03:53:10Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (received)succeeded at 2022-02-02T03:53:06Z for job(01)
2022-02-02T03:53:26Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=succeeded: (polled)failed at 2022-02-02T03:53:26Z for job(01)
2022-02-02T03:53:26Z INFO - [ukv_um_fcst_05.20211126T0500Z] -job(01) failed, retrying in PT30S (after 2022-02-02T03:53:56Z)
2022-02-02T03:54:02Z INFO - [ukv_um_fcst_05.20211126T0500Z] -submit-num=02, owner@host=gadi.nci.org.au
2022-02-02T03:54:36Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=ready: (internal)submitted at 2022-02-02T03:54:23Z for job(02)
2022-02-02T03:54:36Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: submission timeout=P1D, polling intervals=30*PT5M,...
2022-02-02T03:59:59Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T04:04:59Z))
2022-02-02T04:01:49Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (polled)submitted at 2022-02-02T03:54:23Z for job(02)
2022-02-02T04:01:50Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=submitted: (received)started at 2022-02-02T04:01:44Z for job(02)
2022-02-02T04:01:50Z INFO - [ukv_um_fcst_05.20211126T0500Z] -health check settings: execution timeout=PT24M, polling intervals=2*PT2M,2*PT5M,PT1M,PT2M,PT7M,...
2022-02-02T04:03:50Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT2M (after 2022-02-02T04:05:50Z))
2022-02-02T04:04:05Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (polled)started at 2022-02-02T04:01:44Z for job(02)
2022-02-02T04:06:07Z INFO - [ukv_um_fcst_05.20211126T0500Z] -poll now, (next in PT5M (after 2022-02-02T04:11:07Z))
2022-02-02T04:06:07Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (received)succeeded at 2022-02-02T04:06:05Z for job(02)
2022-02-02T04:06:27Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=succeeded: (polled)succeeded at 2022-02-02T04:06:05Z for job(02)

The key lines appear to be these

2022-02-02T03:53:10Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=running: (received)succeeded at 2022-02-02T03:53:06Z for job(01)
2022-02-02T03:53:26Z INFO - [ukv_um_fcst_05.20211126T0500Z] status=succeeded: (polled)failed at 2022-02-02T03:53:26Z for job(01)

The job returns a message that it is successful, then polling is run and returns a failed state.

I’m not sure why cylc is polling or why it is returning a failed state. I’m not sure what cylc is polling either. Therefore, I don’t know if there is any way to fix it, or where to look.

I’ve only been noticing this behaviour in the last week, so it could be a result of system issues. The PBS server seems to be overloaded as well lately. There are myriad issues (and often are), but this one I haven’t noticed before.
It’s also common for a job to run, and the suite thinks it is still in the submitted state; I’m not curretnly worried about this because polling would normally resolve it.

srennie · February 2, 2022, 5:44am

I can also see cases just now where the task has succeeded, and cylc poll returns a status of submitted or submit-failed, not just failed.

2022-02-02T05:10:53Z INFO - [ukv_ops_process_airs.20211126T0700Z] -submit-num=01, owner@host=gadi.nci.org.au
2022-02-02T05:11:37Z INFO - [ukv_ops_process_airs.20211126T0700Z] status=ready: (internal)submitted at 2022-02-02T05:11:19Z for job(01)
2022-02-02T05:11:37Z INFO - [ukv_ops_process_airs.20211126T0700Z] -health check settings: submission timeout=P1D, polling intervals=30*PT5M,...
2022-02-02T05:16:46Z INFO - [ukv_ops_process_airs.20211126T0700Z] -poll now, (next in PT5M (after 2022-02-02T05:21:46Z))
2022-02-02T05:17:17Z INFO - [ukv_ops_process_airs.20211126T0700Z] status=submitted: (received)started at 2022-02-02T05:17:06Z for job(01)
2022-02-02T05:17:17Z INFO - [ukv_ops_process_airs.20211126T0700Z] -health check settings: execution timeout=PT20M, polling intervals=2*PT2M,PT5M,2*PT2M,PT7M,...
2022-02-02T05:17:34Z INFO - [ukv_ops_process_airs.20211126T0700Z] status=running: (received)succeeded at 2022-02-02T05:17:34Z for job(01)
2022-02-02T05:21:34Z INFO - [ukv_ops_process_airs.20211126T0700Z] status=succeeded: (polled)submitted at 2022-02-02T05:11:31Z for job(01)

It looks like it is either polling way after the job returns a succeeded message, or that the polling at 5:16:46 didn’t return its result until 5:21:34. Either way, this doesn’t explain the failure state for the job that didn’t fail.

dpmatthews · February 2, 2022, 12:26pm

I think this is the issue?

github.com/cylc/cylc-flow

Polling can result in incorrect submit-failed

opened 04:39PM - 23 Sep 21 UTC

closed 08:10AM - 07 Oct 21 UTC

dpmatthews

bug

See problem reported on Discourse: https://cylc.discourse.group/t/how-does-cylc-…read-pbs-job-exit-code/375/4 I've looked at the polling logic: https://github.com/cylc/cylc-flow/blob/7.8.7/lib/cylc/batch_sys_manager.py#L299 If I understand it correctly, we read all the job status files and then check the queue status for any jobs which haven't previously exited. This means there is a chance of inconsistency if the polling checks the job status file and finds it hasn't started yet but then, by the time it checks for the job in the queue it has already run and exited the queue. The polling would then report the job as exited without having started which results in submit-failed. In normal circumstances this is very unlikely to happen but if we are polling a large number of jobs at the same time and/or the login node is heavily loaded it is a possibility. I have been able to reproduce the problem (using Cylc 7.8.7) by inserted a sleep in the code in between the read of the status files and the queue status check (to mimic the read of the status files taking a long time). To fix the problem I think we should re-read the status file for any job found to have exited the batch system to ensure we report up to date information.

If so it was fixed in Cylc 7.9.5 / 7.8.10. Which version are you using?

If PBS is struggling that badly you’re likely to have problems.
You are adding to the PBS load with the frequent polling (every 5 minutes whilst a job is queued and then more frequently whilst it is running). I suggest you take a look at your polling setup (presumably in the cylc global config).

srennie · February 2, 2022, 10:13pm

I saw that one. It didn’t seem related because it didn’t explain why a job that had already reported back as succeeded would shortly then be polled and return a failed state.

I can understand how if a job finishes, and it completes before polling, that cylc might think it is submit-failed. I can sort of imagine how a job could complete quickly, and a polling would receive back a submitted state in the interim. I can’t imagine how it could come up with a failed state. How could it find a failed status?

Although given that I’m getting all the variations, I suppose we don’t want to imagine that there’s more than one bug happening.

I am using 7.8.3, which is the default on the system. We only have as recent as 7.8.8 installed though.

I suppose if we reduce the polling, it would reduce the load a bit. On the other hand, since we also sometimes get communication issues between remote and local hosts, then you end up with people adding extra polling to their suites because they get jobs that appear to be running or submitted, which actually finished an hour ago. Damned either way, really.

Yes, PBS has been struggling occasionally. Though I don’t know enough about cylc to be sure that that could be related to polling issues. That’s partly why I was asking.

dpmatthews · February 3, 2022, 7:35am

The poll which results in the failed state reports the job as started but no longer in the queue - this is treated as a failure (on the assumption that the job has died without reporting an exit status). Although that wasn’t the error I tested for I think the slow polling can also explain this.

This is also related to https://github.com/cylc/cylc-flow/issues/4513 which questions (amongst other things) why we allow a succeeded task to have it’s state changed via polling.

srennie · February 3, 2022, 7:48am

Aha! Thanks for explaining that.

Topic		Replies	Views
Polling reporting submit-failed/failed when jobs still in queue Cylc Support	2	344	September 26, 2022
Job succeeded but a prior poll came back saying it was submitted - can no longer poll task Cylc Support	6	34	March 27, 2025
Cylc job hanging and failing to submit Cylc Support	2	95	May 20, 2024
Cylc not capturing failed jobs Cylc Support	5	962	August 29, 2019
Job state poll error after workflow migration Cylc Support	3	20	November 29, 2024

Succeeded jobs polling as failed

Related topics