Job state poll error after workflow migration

We have a few workflows running in production. This particular one is using Cylc 8.3.3.

We condemned the host the workflow was running on yesterday. The workflow continued to submit jobs for a while and then shut itself down, migrated to another host and seemed fine but it was not.

At least one task, 20241128T0330Z/library_AGCD remained in the GUI as submitted but did not update with success due to what seems to be a bug when parsing the output of a job poll.

In the scheduler/log I can see:

2024-11-28T03:39:40Z INFO - Workflow shutting down - REQUEST(NOW-NOW)
2024-11-28T03:39:40Z WARNING - Orphaned tasks:
    * 20241128T0330Z/library_ADFD (submitted)
    * 20241128T0330Z/library_AGCD (submitted)
    * 20241128T0330Z/feed_GPS (running)
    * 20241128T0330Z/process_radar_wind (running)
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/library_GROUP_B/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/feed_radar_wind/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/feed_ADFD/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/feed_AGCD/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/process_ADFD/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/process_AGCD/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:41Z INFO - platform: prod-dm_simpler_d - remote tidy (on prod-dm.ddns.australis.internal.bom.gov.au)
2024-11-28T03:39:43Z INFO - Attempting to restart on "prod-cylc.$DOMAIN"
2024-11-28T03:40:36Z INFO - Workflow now running on "prod-cylc.$DOMAIN".
2024-11-28T03:40:36Z INFO - DONE

A few seconds later:

2024-11-28T03:40:40Z INFO - + 20241128T0330Z/library_AGCD submitted
2024-11-28T03:40:40Z INFO - [20241128T0330Z/library_AGCD/01:submitted(runahead)]
 => submitted

[...]

2024-11-28T03:40:40Z INFO - + 20241128T0330Z/library_AGCD poll_timer
2024-11-28T03:40:40Z INFO - + 20241128T0330Z/library_AGCD ['try_timers', 'submission-retry']
2024-11-28T03:40:40Z INFO - + 20241128T0330Z/library_AGCD ['try_timers', 'execution-retry']

And then when the workflow attempts to poll it, the poll output is “unhandled”:

2024-11-28T03:40:42Z INFO - [20241128T0330Z/library_AGCD/01:submitted] (polled)succeeded
2024-11-28T03:40:42Z INFO - [20241128T0330Z/library_AGCD/01:submitted] setting implied output: submitted
2024-11-28T03:40:42Z INFO - [20241128T0330Z/library_AGCD/01:submitted] submitted to prod-dm_simpler_d:None[None]
2024-11-28T03:40:42Z WARNING - Unhandled jobs-poll output: 2024-11-28T03:40:42Z|20241128T0330Z/library_AGCD/01|{"job_runner_name": "pbs", "job_id": "4292375", "run_status": 0, "time_submit_exit": "2024-11-28T03:39:37Z", "time_run": "2024-11-28T03:39:40Z", "time_run_exit": "2024-11-28T03:39:46Z"}
2024-11-28T03:40:42Z WARNING - list index out of range

The job-activity.log shows the same poll result but no awareness that the job succeeded.

[jobs-submit ret_code] 0
[jobs-submit out] 2024-11-28T03:39:37Z|20241128T0330Z/library_AGCD/01|0|4292375
2024-11-28T03:39:37Z [STDOUT] 4292375.pbs-edc
[jobs-poll ret_code] 0
[jobs-poll out] 2024-11-28T03:40:42Z|20241128T0330Z/library_AGCD/01|{"job_runner_name": "pbs", "job_id": "4292375", "run_status": 0, "time_submit_exit": "2024-11-28T03:39:37Z", "time_run": "2024-11-28T03:39:40Z", "time_run_exit": "2024-11-28T03:39:46Z"}

These are the jobs in Cylc UI Server 1.5.0.

Recovering from this is proving harder than I anticipated. I have attempted to use the default ‘set’. I have attempted to explicitly set succeeded, I have tried to trigger the task. None of these have changed the interface nor allowed dependent tasks to progress.

2024-11-29T04:37:49Z INFO - Command "set" received from jacric-c1op. ID=0e6a251f-3c03-4beb-a75b-a25323b2a59a
    set(flow=['all'], flow_wait=False, outputs=[], prerequisites=[], tasks=['20241128T0330Z/library_ADFD'])
2024-11-29T04:37:49Z INFO - Command "set" actioned. ID=0e6a251f-3c03-4beb-a75b-a25323b2a59a
2024-11-29T04:37:51Z INFO - Command "set" received from jacric-c1op. ID=9e4157d3-3478-415d-be83-05c319a7170d
    set(flow=['all'], flow_wait=False, outputs=[], prerequisites=[], tasks=['20241128T0330Z/library_AGCD'])
2024-11-29T04:37:52Z INFO - Command "set" actioned. ID=9e4157d3-3478-415d-be83-05c319a7170d
2024-11-29T04:38:23Z INFO - Command "set" received from jacric-c1op. ID=d5733039-26c4-44bd-8d56-4342e6f06be3
    set(flow=['all'], flow_wait=False, outputs=['succeeded'], prerequisites=[], tasks=['20241128T0330Z/library_ADFD'])
2024-11-29T04:38:23Z INFO - output 20241128T0330Z/library_ADFD:succeeded completed already
2024-11-29T04:38:23Z INFO - Command "set" actioned. ID=d5733039-26c4-44bd-8d56-4342e6f06be3
2024-11-29T04:39:30Z INFO - Command "force_trigger_tasks" received from jacric-c1op. ID=e2a7b0bc-1c93-4c99-a646-35ff1b93cfc3
    force_trigger_tasks(flow=['all'], flow_wait=False, tasks=['20241128T0330Z/library_ADFD'])
2024-11-29T04:39:30Z WARNING - [20241128T0330Z/library_ADFD/01:submitted] ignoring trigger - already active
2024-11-29T04:39:30Z INFO - Command "force_trigger_tasks" actioned. ID=e2a7b0bc-1c93-4c99-a646-35ff1b93cfc3

Fortunately with only 2 failed jobs, I am able to just trigger all dependent tasks explicitly and then “removed” the problem jobs.

I think this is fixed in 8.3.6: polling/set: submitted task cannot be set to succeeded · Issue #6314 · cylc/cylc-flow · GitHub

1 Like