Kill task cylc 8.6.2

Good afternoon,

We recently udpated our cycl version to 8.6.2, and we noticed a change.

Before, when from the tui you killed a task, running on a platform with

job runner = slurm

(so, basically a job running on compute nodes), the tui was waiting for the queue manager to have completely killed the job before showing the task as failed.

Basically, while the job status was “COMPLETING”, the tui did not show the task as failed (red) but as running (blue) and you could not retrigger it. And I considered it to be the right behavior: this prevented us to launch a new run, that could potentially start before the previous ended, trying to access files that are still “owned” by the previus run, causing “permission denied” failures.

Now that we have updated cylc, it looks to me that this is no longer true: as soon as I hit “kill”, the task is showed as failed. (On our cluster, SLURM can take up to 5 min to scancel big jobs … :melting_face:)

Can you reproduce the same behavior? Is it a bug?

All the best,
Stella

Hi @sparonuz

I don’t think we’ve made any changes that would affect this behavior in recent Cylc versions.

If you kill a task in Cylc, it only goes to “failed” status once the job failed message is received from the job itself (assuming you are using tcp job comms, not polling??), if the kill succeeds. [Update: wrong, see below].

I think the Slurm “COMPLETING” status occurs after job execution finishes, and it exists for post-execution Slurm job tidy-up reasons and/or Slurm job epilog scripts.

So if Slurm is running slowly for some reason, it seems likely that jobs will remain as COMPLETING (in Slurm) for a while after the task is recorded as failed by Cylc.

I’m not sure if that has implications for retriggering the job quickly or not. I don’t have access to Slurm, but perhaps others can comment.

Hilary

In our case we cannot use TCP, so we’re stuck with polling. I noticed however, that while in the previous version (8.6.0) when you killed or stopped a job it would take a while for the icon to turn red, but now it’s practically instantaneous.

Hi @hilary.j.oliver ,

Thanks for the quick reply!

From the SLURM documentation:

When a job is terminating, both the job and its nodes enter the COMPLETING state.[…] When every node allocated to a job has determined that all processes associated with it have terminated, the job changes state to COMPLETED or some other appropriate state (e.g. FAILED). Normally, this happens within a second. However, if the job has processes that cannot be terminated with a SIGKILL signal, the job and one or more nodes can remain in the COMPLETING state for an extended period of time. This may be indicative of processes hung waiting for a core file to complete I/O or operating system failure.

So, I would expect (maybe I am wrong), that until the status is COMPLETED (or FAILED) cylc does not mark the task as finished (be it succeeded or not). Could it be a problem on our system, and cylc is receiving the “kill succeeded” message before the task completes (still COMPLETING)? How is this message sent to cylc?

Thanks for the help,
Stella

Ok, I took a look into cylc’s code, and I think that maybe I have a clearer picture.

This (inside job_runner_mgr.py) must be the code executed by Cylc when hitting “kill” on a SLURM task:

if hasattr(job_runner, "KILL_CMD_TMPL"):
  for line in st_file:
      if not line.startswith(f"{self.CYLC_JOB_ID}="):
          continue
      job_id = line.strip().split("=", 1)[1]
      command = shlex.split(
          job_runner.KILL_CMD_TMPL % {"job_id": job_id})
      try:
          proc = procopen(command, stdindevnull=True,
                          stderrpipe=True)
      except OSError as exc:
         [...]
      else:
          return (
              proc.wait(),
              proc.communicate()[1].decode()
          )

So, in SLURM’s case, there is a Popen call that sends the scancel -j jobid command to our cluster. Once scancel returns the handle (few seconds), the call returns (wait() terminates) and

CYLC_JOB_EXIT=TERM

is written into job.status file. And this is how Cylc determines that the job failed?

If I am right, then … there is no way this was working differently in older Cycl version, and I probably have a false memory (or just the polling was slower in older versions).

I am still wondering if this is the right behavior (probably yes).

Many thanks again,
Stella

Yes my speculation above about the Slurm COMPLETING state might be more appropriate to normal job success/failure, rather than job kill via scancel.

For normal (job-internal) succeed/fail, the Cylc job wrapper updates the job.status file (and in the TCP messaging case sends a succeeded/failed message back) just before exiting - after which Slurm will presumably detect that the job finished and might briefly go into the COMPLETING state for cleanup (or less briefly if there are other hung processes?)

If there is a kill request (scancel) then conceivably Slurm could change the job status to COMPLETING just before attempting to kill it. (although I don’t know if it does that).

Either way, it seems I may have been wrong above about what happens at the Cylc end. We need to investigate a bit more…

.

Just following up on this @sparonuz -

I was wrong above - in fact a killed task gets set to “failed” status as soon as the job kill command returns success.

The reason for this is probably that waiting for confirmation of job kill would likely cause confusion (I killed it, but nothing happened!) and hence erroneous attempts to re-kill the task.

This might entail a tiny risk that the real job does not actually get killed, even though the kill command succeeded, and hence a misleading task state, but I don’t think we’ve ever encountered such a problem in the wild.

The Slurm COMPLETING state is interesting though. I think it exists for Slurm clean-up reasons (including job post-fix scripts) and as such all Cylc task job processes (i.e. the stuff that the workflow cares about) should be finished by then. If so, telling the task to resubmit while the first job is still in COMPLETING should not cause a problem - the new job will have a new job ID and new Cylc task submit number, and hence a new Cylc job log directory, so there should be no interference.

But if your system is consistently leaving jobs as COMPLETING for a while, perhaps you could experiment to confirm or deny that speculation, and let us know the result?

[UPDATE: it could potentially be a problem - see below].

If a job is in the COMPLETING state then it means the job still has processes running because they haven’t (yet) responded to SIGKILL. This is almost certainly because they are stuck in D state (uninterruptable sleep) - this shouldn’t happen for extended periods unless your filesystem is misbehaving/overloaded. So, I can see how this could cause a problem for a re-run but I don’t think there’s anything we can do to prevent it.

Thanks @dpmatthews - I had to look that up:

In Linux, uninterruptible sleep (state D) is a process state where a process is waiting for a critical event, typically I/O completion, and will not respond to any signals, including the SIGKILL signal. This is done to ensure data integrity during hardware interactions

So agreed, filesystem issues might result in this state for a while, after an attempted job kill, if the job is doing IO at the time.

Even so, an immediate Cylc re-trigger shouldn’t normally cause additional problems, should it? The new job (even though it’s for the same task) will have a different job ID, working directory, log directory etc. (Hopefully the task job is not writing to files outside of its designated workflow directories).

@hilary.j.oliver The task could be trying to open files in the share directory which are still open by the hung process.

Ah, true - it could!

OK, I stand corrected - jobs stucked as COMPLETING for a while could potentially cause problems.

Thanks @hilary.j.oliver and @dpmatthews for this exchange, and sorry for the late answer (for some reason my email decided to archive this tread).

So: I never thought that the reading part could be a problem (input files can be read by multiple processes at a time normally, unless they are open in read&write mode, which is not our case).

Unfortunatly we find ourselves in the other situation mentioned: our program writes logs to a directory outside the workdir. This is a design choice (and a very poor one, let’s be honest), but unfortunately we can’t change right now. This means that a re-triggered run will attempt to open log files that are still owned by the COMPLETING one, thus crashing.

Looking into SLURM documentatio though, I got an idea (I have not tested it yet), but adding a directive such as

--dependency=SINGLETON

It means that the retriggered job will wait for the COMPLETING to be COMPLETED to start, since:

This job can begin execution after any previously launched jobs sharing the same job name and user have terminated. In other words, only one job by that name and owned by that user can be running or suspended at any point in time.

And since cylc job names do not have the ID of the attempt in them, this should work as temporary fix …

What do you think? Are you discussing of some other fix?

Best
Stella

Well, I was thinking a bit more, and I have a question.

As I mentioned in the beginning, we are forced to use job polling comms. In my head it works as follows:

  1. After you sbatch a job: cylc will read, at every polling interval, the job.status file, searching for the CYLC_JOB_EXIT message. Once it finds it, the status of the task is set accordingly (succeeded, failed, timeout, etc.)
  2. In case you kill it, the return value of the scancel command (or equivalent for other queue managers) determines the status of the task. So if scancel succeeds, the task is set to FAIL directly.

In the second case, instead of setting the task to fail, wouldn’t it be more correct to set it to a status like “cancelling”, and set it to fail only once the CYLC_JOB_EXIT message is written to the job.status file (so when the job is really cancelled)?

If I am not wrong the scancel will not prevent the job to update the job.status, but of course, I have no idea how much work would this imply on cylc side …

Best,
Stella

I don’t think it’s unreasonable for Cylc to regard the job as failed immediately after a successful scancel. In normal circumstances the job should be killed by Slurm almost immediately. Changing this behaviour in Cylc would not be trivial (and probably not desirable).

From what I can see, using --dependency=singleton should work. If it does, you can configure your platform to add this directive by default rather than having to add it to your workflow.

Note that this will not work in all cases. If a job can’t be killed for an extended period, Slurm will timeout and remove the job from the queue (and drain the node). However, this shouldn’t happen unless your filesystem is really sick.

Yeah, I guess if the filesystem is really sick then maybe retrigger is not what I want anyway.

I will go for the singleton solution, thanks again for the discussion!
Stella