We have a few workflows running in production. This particular one is using Cylc 8.3.3.
We condemned the host the workflow was running on yesterday. The workflow continued to submit jobs for a while and then shut itself down, migrated to another host and seemed fine but it was not.
At least one task, 20241128T0330Z/library_AGCD
remained in the GUI as submitted
but did not update with success due to what seems to be a bug when parsing the output of a job poll.
In the scheduler/log I can see:
2024-11-28T03:39:40Z INFO - Workflow shutting down - REQUEST(NOW-NOW)
2024-11-28T03:39:40Z WARNING - Orphaned tasks:
* 20241128T0330Z/library_ADFD (submitted)
* 20241128T0330Z/library_AGCD (submitted)
* 20241128T0330Z/feed_GPS (running)
* 20241128T0330Z/process_radar_wind (running)
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/library_GROUP_B/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/feed_radar_wind/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/feed_ADFD/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/feed_AGCD/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/process_ADFD/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:40Z WARNING - 20241128T0330Z/process_AGCD/01: incomplete task event handler ('job-logs-retrieve', 'succeeded')
2024-11-28T03:39:41Z INFO - platform: prod-dm_simpler_d - remote tidy (on prod-dm.ddns.australis.internal.bom.gov.au)
2024-11-28T03:39:43Z INFO - Attempting to restart on "prod-cylc.$DOMAIN"
2024-11-28T03:40:36Z INFO - Workflow now running on "prod-cylc.$DOMAIN".
2024-11-28T03:40:36Z INFO - DONE
A few seconds later:
2024-11-28T03:40:40Z INFO - + 20241128T0330Z/library_AGCD submitted
2024-11-28T03:40:40Z INFO - [20241128T0330Z/library_AGCD/01:submitted(runahead)]
=> submitted
[...]
2024-11-28T03:40:40Z INFO - + 20241128T0330Z/library_AGCD poll_timer
2024-11-28T03:40:40Z INFO - + 20241128T0330Z/library_AGCD ['try_timers', 'submission-retry']
2024-11-28T03:40:40Z INFO - + 20241128T0330Z/library_AGCD ['try_timers', 'execution-retry']
And then when the workflow attempts to poll it, the poll output is “unhandled”:
2024-11-28T03:40:42Z INFO - [20241128T0330Z/library_AGCD/01:submitted] (polled)succeeded
2024-11-28T03:40:42Z INFO - [20241128T0330Z/library_AGCD/01:submitted] setting implied output: submitted
2024-11-28T03:40:42Z INFO - [20241128T0330Z/library_AGCD/01:submitted] submitted to prod-dm_simpler_d:None[None]
2024-11-28T03:40:42Z WARNING - Unhandled jobs-poll output: 2024-11-28T03:40:42Z|20241128T0330Z/library_AGCD/01|{"job_runner_name": "pbs", "job_id": "4292375", "run_status": 0, "time_submit_exit": "2024-11-28T03:39:37Z", "time_run": "2024-11-28T03:39:40Z", "time_run_exit": "2024-11-28T03:39:46Z"}
2024-11-28T03:40:42Z WARNING - list index out of range
The job-activity.log shows the same poll result but no awareness that the job succeeded.
[jobs-submit ret_code] 0
[jobs-submit out] 2024-11-28T03:39:37Z|20241128T0330Z/library_AGCD/01|0|4292375
2024-11-28T03:39:37Z [STDOUT] 4292375.pbs-edc
[jobs-poll ret_code] 0
[jobs-poll out] 2024-11-28T03:40:42Z|20241128T0330Z/library_AGCD/01|{"job_runner_name": "pbs", "job_id": "4292375", "run_status": 0, "time_submit_exit": "2024-11-28T03:39:37Z", "time_run": "2024-11-28T03:39:40Z", "time_run_exit": "2024-11-28T03:39:46Z"}