I’m trying to modify a cylc suite to handle the situation where a remote job is submitted (successfully) but gets into a state such that it will never run (but remains in the queue on the remote host). I had this recently where a job submitted via PBS got moved into the HELD state and stayed like that for many days before I spotted it.
The underlying problems are me (not spotting the problem, but it is a suite that runs reliably / quietly for long periods) and the remote host badly handling that job. But I want to improve the suite to handle this better.
I added this to [[[events]]] to ensure I get email notification when a job has been submitted but not started executing after some time:
mail events = submission timeout
submission timeout = PT12H
Then I can manually - via the GUI - kill that job and resubmit it.
I can’t figure out how to automatically do that - I did get as far as writing a submission timeout handler script to kill the task but I can’t work out how to resubmit it. When the handler script kills the job it goes into submit-failed state but adding submission retry delays doesn’t result in a resubmit attempt.
I may be getting confused about what “submission” means in different contexts, presumably submission retries only apply to jobs that could not be submitted, rather than jobs submitted but that didn’t start?
Can anyone offer any advice here? I may be missing something really obvious.