Hi Hilary, thanks for clarifying.
The point of cylc workflow-state is that I was copying Killing and resubmitting jobs that will never run - Cylc / Cylc Support - Cylc Workflow Engine. I think it forces the workflow to wait until the task status is resolved? otherwise the release command might occur before the task is killed and becomes held - if this still occurs in cylc8?
I haven’t got any more information about why the job kill failed. In fact, since we had a few system issues over the weekend, I had a lot of jobs hit the submission timeout and get successfully killed (good timing to test my script?!) so I haven’t even had a repeat of the kill failure. Maybe I was unlucky in that the very first time it ran in a “real” workflow, there was some problem that cause the kill failure.
Last week I was using a test suite, but even with a PT1S submission timeout, my jobs would usually begin before the kill command was issued - although I guess it just killed a running job instead of a queued job. That did seem to work, so I transferred the handler to my real workflow, where I promptly got the failure above. The only difference I could think of is that the kill failure was for a job in a held state in PBS, not queued or running. Though qdel should still work in that case.