Cylc kill failing within event handler

I am having issues with an event handler. It is set up as

        [[[events]]]
           submission timeout = PT20M
           handler events = submission timeout
           handlers= """
cylc kill %(workflow)s//%(id)s
cylc workflow-state %(workflow)s -t %(name)s -T
cylc release %(workflow)s//%(id)s
"""

The cylc kill command is failing with

2024-04-05T08:22:51Z INFO - [command] kill_tasks
2024-04-05T08:22:51Z INFO - [20240221T0700Z/aa_um_fcst_07 submitted job:01 flows:1] => submitted(held)
2024-04-05T08:22:51Z INFO - Command actioned: kill_tasks(['20240221T0700Z/aa_um_fcst_07'])
2024-04-05T08:22:54Z ERROR - [jobs-kill cmd] ssh -oBatchMode=yes -oConnectTimeout=10 <remote host> env CYLC_VERSION=8.2.3 CYLC_ENV_NAME=cylc-aps4_a bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-kill -- '$HOME/cylc-run/cx959_spin/run1/log/job' 20240221T0700Z/aa_um_fcst_07/01
    [jobs-kill ret_code] 168
    [jobs-kill out] 2024-04-05T08:22:53Z|20240221T0700Z/aa_um_fcst_07/01|168
2024-04-05T08:22:54Z WARNING - [20240221T0700Z/aa_um_fcst_07 submitted(held) job:01 flows:1] job kill failed

I redacted the hostname for brevity.

The cylc kill command works fine from the command line of the workflow host, but not within the event handler. Internally, it should run a PBS qdel on the job id, so I assume it is having issues getting to that point in cylc jobs-kill.
Any idea what is wrong, or how to investigate? Especially since I have little control over when I will get a task that has a submission timeout.

Since yesterday, I found a case where it looks like the job was killed. So not sure why it failed previously.
However, job-activity.log shows an error running the event handler

[(('event-handler-00', 'submission timeout'), 2) cmd]
cylc kill cx959_spin/run1//20240222T0000Z/aa_surf_ascat_ekf
cylc workflow-state cx959_spin/run1 -t aa_surf_ascat_ekf -T
cylc release cx959_spin/run1//20240222T0000Z/aa_surf_ascat_ekf
[(('event-handler-00', 'submission timeout'), 2) ret_code] 0
[(('event-handler-00', 'submission timeout'), 2) out]
Done
Done
[(('event-handler-00', 'submission timeout'), 2) err] InputError: CYLC_TASK_CYCLE_POINT is not defined

I think this means the -T in cylc workflow-state is not working. I’m not sure if $CYLC_TASK_CYCLE_POINT is supposed to be available for the event handlers; I assume cylc has defined it inside the running task, but not in the case where the job is not yet running?

Yeah, $CYLC_TASK_CYCLE_POINT is a job environment variable. Task event handlers are run by the scheduler in response to task events, not by task jobs. The scheduler environment does not know the task cycle point because there is no universal cycle point - it is task-specific.

You can however pass the cycle point of the event-generating task to the event handler command line with string templating:

cylc workflow-state %(workflow)s -t %(name)s -p %(point)s

Do you not get any more information about why the job kill failed, in debug mode?

Note you could test this by setting a very short job submission timeout.

I’m curious why you need an event handler like this? It seems surprising that a cancelled and resubmitted job would spend any less time in the batch queue than the original.

Also, what’s the purpose of the cylc workflow-state command in the event handler? With those options it will just print the current state of task.

And finally, note that cylc kill followed by an immediate cylc release (to release the post-kill hold on the task) might not work as you expect - the scheduler has to run an asynchronous process that kills the job on the job platform, which can take a little time, so your release command might get actioned before the the kill (and thus the hold) actually happens.

Hi Hilary, thanks for clarifying.

The point of cylc workflow-state is that I was copying Killing and resubmitting jobs that will never run - Cylc / Cylc Support - Cylc Workflow Engine. I think it forces the workflow to wait until the task status is resolved? otherwise the release command might occur before the task is killed and becomes held - if this still occurs in cylc8?

I haven’t got any more information about why the job kill failed. In fact, since we had a few system issues over the weekend, I had a lot of jobs hit the submission timeout and get successfully killed (good timing to test my script?!) so I haven’t even had a repeat of the kill failure. Maybe I was unlucky in that the very first time it ran in a “real” workflow, there was some problem that cause the kill failure.

Last week I was using a test suite, but even with a PT1S submission timeout, my jobs would usually begin before the kill command was issued - although I guess it just killed a running job instead of a queued job. That did seem to work, so I transferred the handler to my real workflow, where I promptly got the failure above. The only difference I could think of is that the kill failure was for a job in a held state in PBS, not queued or running. Though qdel should still work in that case.

Ah OK, that old post was missing a detail (I’ve just updated it). You need to specify the desired task state and polling options (--interval and --max-polls) to get the command to wait for the state to be achieved.

“System issues” can definitely cause job kill (and much else) to fail. The best Cylc can do in that situation is report that the command failed. If there’s anything wrong in terms of Cylc internals, workflow or job platform configuration, it should fail every time. So hopefully your event handler is good to go with the right workflow-state polling options.