Server thinks task pre-requisites not satisfied when they are

. . .and a new issue here.

The task couple_ma-to-da_q.20140101T1815Z has completed and succeeded:

$ cylc show NIMO couple_ma-to-da_q.20140101T1815Z
(snip)
prerequisites (- => not satisfied):
  + run_model_advance_q.20140101T1800Z succeeded

outputs (- => not completed):
  - couple_ma-to-da_q.20140101T1815Z expired
  + couple_ma-to-da_q.20140101T1815Z submitted
  - couple_ma-to-da_q.20140101T1815Z submit-failed
  + couple_ma-to-da_q.20140101T1815Z started
  + couple_ma-to-da_q.20140101T1815Z succeeded
  - couple_ma-to-da_q.20140101T1815Z failed

The suite server log confirms that the server knows this task has succeeded:

$ tail -7 log/suite/log
2021-07-28T01:41:48Z INFO - [couple_ma-to-da_q.20140101T1815Z] -submit-num=01, owner@host=localhost
2021-07-28T01:41:48Z INFO - [couple_ma-to-da_q.20140101T1815Z] -triggered off ['run_model_advance_q.20140101T1800Z']
2021-07-28T01:41:49Z INFO - [couple_ma-to-da_q.20140101T1815Z] status=ready: (internal)submitted at 2021-07-28T01:41:49Z for job(01)
2021-07-28T01:41:49Z INFO - [couple_ma-to-da_q.20140101T1815Z] -health check settings: submission timeout=None
2021-07-28T01:42:31Z INFO - [couple_ma-to-da_q.20140101T1815Z] status=submitted: (received)started at 2021-07-28T01:42:31Z for job(01)
2021-07-28T01:42:31Z INFO - [couple_ma-to-da_q.20140101T1815Z] -health check settings: execution timeout=PT20M, polling intervals=PT11M,PT2M,PT7M,...
2021-07-28T01:42:38Z INFO - [couple_ma-to-da_q.20140101T1815Z] status=running: (received)succeeded at 2021-07-28T01:42:38Z for job(01)

But then I look at the next task in the processing chain, which directly depends on this task:

$ cylc show NIMO run_da_q.20140101T1815Z
(snip)
prerequisites (- => not satisfied):
  - couple_ma-to-da_q.20140101T1815Z succeeded
  + prep_da_q.20140101T1815Z succeeded
  + configure_da.20140101T0000Z succeeded

outputs (- => not completed):
  - run_da_q.20140101T1815Z expired
  - run_da_q.20140101T1815Z submitted
  - run_da_q.20140101T1815Z submit-failed
  - run_da_q.20140101T1815Z started
  - run_da_q.20140101T1815Z succeeded
  - run_da_q.20140101T1815Z failed

So we’ve established that the success message from the task couple_ma-to-da_q.20140101T1815Z reached the server and made it into the server log, and “cylc show” shows that task succeeded. But “cylc show” on another task that depends on this task shows that the prerequisite of this task’s success has not been fulfilled.

How can this happen?

I’d hope to be able to say that can’t happen, but evidently you have a counter example!

Can you remind me which Cylc version you’re using? (7.9.3?)

Are there no warnings in the suite log?

And is the problem repeatable?

Did you make changes to the graph using cylc reload or add/remove tasks from the pool using cylc insert or cylc remove?

Hi folks. Version is 7.8.3, no warnings in the log, no changes to the suite or the task pool. HOWEVER . . .shortly after I posted the original message above, commands from the command line on the platform in question started hanging, as did attempts to log into the host (on any login node). They stayed that way until after I went to sleep. This morning, I can see that the suite became unstuck – that “succeeded” prerequisite apparently eventually was seen as satisfied – until a task failed a dozen cycle points later with a PBS time exceeded limit, which should never have happened and which I’ve seen before when the system was experiencing disk issues. So I’m betting that this wasn’t a Cylc issue at all, but an issue with delays or other burps in file system I/O. Does that seem reasonable to you?

That does seem a reasonable non-Cylc explanation, thanks for updating us.

I think it’s highly unlikely that there’s a bug affecting something as fundamental as this, or we’d have seen it before, with tens of millions of jobs being managed by Cylc every month.

Perhaps you should try to get upgraded to 7.9.3 though, as plenty of smaller bugs got fixed between 7.8.3 and 7.9.3.