Killing and resubmitting jobs that will never run

I’m trying to modify a cylc suite to handle the situation where a remote job is submitted (successfully) but gets into a state such that it will never run (but remains in the queue on the remote host). I had this recently where a job submitted via PBS got moved into the HELD state and stayed like that for many days before I spotted it.

The underlying problems are me (not spotting the problem, but it is a suite that runs reliably / quietly for long periods) and the remote host badly handling that job. But I want to improve the suite to handle this better.

I added this to [[[events]]] to ensure I get email notification when a job has been submitted but not started executing after some time:

mail events = submission timeout
submission timeout = PT12H

Then I can manually - via the GUI - kill that job and resubmit it.

I can’t figure out how to automatically do that - I did get as far as writing a submission timeout handler script to kill the task but I can’t work out how to resubmit it. When the handler script kills the job it goes into submit-failed state but adding submission retry delays doesn’t result in a resubmit attempt.

I may be getting confused about what “submission” means in different contexts, presumably submission retries only apply to jobs that could not be submitted, rather than jobs submitted but that didn’t start?

Can anyone offer any advice here? I may be missing something really obvious.

Thanks,

Adam.

1 Like

I think the problem is that killing the task puts it into the held state, see https://cylc.github.io/doc/built-sphinx/running-suites.html#automatic-task-retry-on-failure

Does anyone know whether issuing a cylc release command immediately after the kill command would be safe or do you need to wait for the kill command to complete?

I wonder whether we need an option to the kill command which prevents the task being held?

Hi @adammaycock,

@dpmatthews is probably right, although I see the user guide only refers to killing a currently-executing job in this context. Did you notice if the killed (/cancelled) job did not retry because it went to the held state?

I think you’d need to wait. cylc kill initiates an asynchronous operation to go out to the job host and literally kill the job, and the task status in the Cylc server program only changes once the kill result returns. cylc release on the other hand operates immediately on the current task status in the server.

Sounds like we do! I’ll put an issue up to record this. In the meantime @adammaycock if you can’t make do with timeout emails followed by manual intervention, a timeout handler should work fine so long as you take the aforementioned kill delay into account. Your event handler script should use the CLI to kill the job, then check-and-wait until its status changes to held, then release it.

Hilary

Thanks @dpmatthews and @hilary.j.oliver, I’ve been able to achieve what I want.

I’ve set submission timeout = PT12H (and added submission timeout to mail events) to ensure I get a notification of stuck jobs., and created a simple submission timeout handler script that does a

cylc kill
cylc suite-state
cylc release

The cylc suites-state takes care of the kill delay. All works as intended in my testing (using an unrealistically small submission timeout delay).

1 Like

The cylc suites-state takes care of the kill delay

FYI the cylc suite-state command can be used to poll, e.g. something like this:

# Try 10 time, once a minute, succeed if the condition is met.
cylc suite-state "${SUITE_NAME}" \
    -t "${CYLC_TASK_NAME}" \
    -p "${CYLC_TASK_CYCLE_POINT}" \
    -s 'submit-failed'
1 Like

Thanks @oliver.sanders, that’s what I’m doing I think. I’m using this after cylc kill

cylc suite-state --task=$name --point=$point --interval=60 --status=held $suite

to check for the killed job to go into the held state before releasing it again. i.e. cylc suite-state takes care of the kill delay for me, I don’t have to code my own checking and waiting.

That delay can be quite long actually; when I tried with shorter --interval values (e.g. 10s) the killed job sometimes didn’t reach the held state within 100s (10 * 10s polling).

2 Likes

Good solution @adammaycock - better than what I suggested :+1: [update: actually I see I didn’t really make an explicit suggestion of which commands to use :grimacing:]

(And pleased to hear it’s working!)

Hilary