Job succeeded but a prior poll came back saying it was submitted - can no longer poll task

I have a job, its stuck in a submitted state. When I do poll on it via the WUI this is what shows in the scheduler log

2025-03-26T23:03:14Z INFO - Command "poll_tasks" received from user. ID=5d278936-25e8-451a-b78e-453bb06737e5
    poll_tasks(tasks=['20250326T1900Z/ph_ops_process_aircraftsondesurface'])
2025-03-26T23:03:14Z WARNING - No active tasks matching: 20250326T1900Z/ph_ops_process_aircraftsondesurface
2025-03-26T23:03:14Z INFO - Command "poll_tasks" actioned with 1 warnings. ID=5d278936-25e8-451a-b78e-453bb06737e5

But, it is definitely still listed as submitted, so is submitted not considered active?

2025-03-26T21:07:54Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface:waiting(runahead)] => waiting
2025-03-26T22:04:49Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface:waiting] => waiting(queued)
2025-03-26T22:04:51Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface:waiting(queued)] => waiting
2025-03-26T22:04:51Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface:waiting] => preparing
2025-03-26T22:04:55Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:preparing] submitted to platform:pbs[id]
2025-03-26T22:04:55Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:preparing] => submitted
2025-03-26T22:06:56Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:submitted] poll now, (next in PT5M (after 2025-03-26T22:11:56Z))
2025-03-26T22:11:56Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:submitted] poll now, (next in PT5M (after 2025-03-26T22:16:56Z))
2025-03-26T22:14:31Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:submitted] (polled)submitted
2025-03-26T22:14:31Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:submitted] submitted to platform:pbs[id]
2025-03-26T22:14:34Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:submitted] => running
2025-03-26T22:16:01Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:running] => succeeded
2025-03-26T22:16:47Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:succeeded] (polled)submitted
2025-03-26T22:16:47Z INFO - [20250326T1900Z/ph_ops_process_aircraftsondesurface/01:succeeded] submitted to platform:pbs[id]
    poll_tasks(tasks=['20250326T1900Z/ph_ops_process_aircraftsondesurface'])
2025-03-26T23:03:14Z WARNING - No active tasks matching: 20250326T1900Z/ph_ops_process_aircraftsondesurface

How do I get this to progress (other than set succeeded given it has run and passed)?

The submitted state is definitely “active” and can certainly be polled.

The task is in fact being polled several times via configured automatic polling:

2025-03-26T22:11:56Z INFO - [20250326T1900Z/ph...ce/01:submitted] poll now, (next in PT5M (after 2025-03-26T22:16:56Z))
2025-03-26T22:14:31Z INFO - [20250326T1900Z/ph...ce/01:submitted] (polled)submitted
2025-03-26T22:14:31Z INFO - [20250326T1900Z/ph...ce/01:submitted] submitted to platform:pbs[id]

The last two lines there are the poll result.

I suppose it’s possible the the UI datastore has incorrect information :thinking: … but I’m not sure I understand your log.

Have you redacted the job IDs? If so, does the same job ID show up in every instance of the string submitted to platform:pbs[id] for that task?

It looks as if a (delayed?) poll result returns submitted almost a full minute after the task succeeds??

2025-03-26T22:14:34Z INFO - [20250326T1900Z/ph...ce/01:submitted] => running
2025-03-26T22:16:01Z INFO - [20250326T1900Z/ph...ce/01:running] => succeeded
2025-03-26T22:16:47Z INFO - [20250326T1900Z/ph...ce/01:succeeded] (polled)submitted

So (speculating) maybe this is a job polling race condition… task succeeds and is removed from the n=0 window by the scheduler (hence it can no longer be polled) then a delayed poll result comes in that wants to return it to the submitted state … the core scheduler correctly ignores that (i.e., it handles the race condition), but somehow it gets into the UI datastore anyway. I’ll see if that seems plausible, in the code …

Setting the succeeded output should work, I think.

That would be the right thing to do if the task still existed in the n=0 window but something had blocked the started and succeeded job status messages.

In your case (if my speculation is correct, which is by no means assured) the task is no longer present in the n=0 window, but you can still set it’s outputs, which will result in downstream activity if it didn’t already happen when the task succeeded just before the bad/delayed poll result.

What version of Cylc are you using?

@TomC - my speculation seems to be correct - I’ve managed to reproduce the problem by deliberately delaying delivery of poll results.

In which case, thanks for the bug report!

For the record, although it sounds like you’ve reproduced a bug, this is (first discovered at least) in 8.3.6.

As a side note, there were PBS server issues around this time, so that may explain what produced the result above.

EDIT: I have updated the title to attempt to reflect what you’ve described as the bug

1 Like

Bug ticket, if you want to follow it:

Delayed poll result contaminates the datastore. · Issue #6686 · cylc/cylc-flow · GitHub

@TomC - the apparent submitted task is only a phantom - it only exists in the UI datastore in the n=1 window, not in the scheduler’s active window (n=0), so it won’t cause the worklfow to stall; it will disappear from view once things move on a bit. (However, it still needs fixing!).

Interesting. We noticed this only because a task in the same family did manage to run and end in a failed state, holding the workflow up overnight. So if that task had not failed, it sounds like we would never have seen it.

1 Like