Suite unresponsive and can't be killed

Hi

I have a suite which is unresponsive. No waiting tasks will run. I can’t hold tasks or submit tasks. I can’t stop the suite.

  • rose suite-stop does not work
  • cylc stop does not work
  • kill -9 PID for the cylc-run process does not die!

How can I stop this suite, or even just get it moving again?

I’m not sure exactly why this happened. We had an issue with the VM it is running on earlier today, probably due to it being overloaded. (commands were hanging etc.) However, I haven’t heard exactly what the problem was, so I have no details on what might have been affected. Maybe the home disk was being hammmered? Yesterday also the VM was very overloaded (though not unresponsive) and my suites similarly got stuck, but eventually started running again. Today one of my suites started up again, but the other is still stuck. (This isn’t usual - or at least, not common - for the suites to get stuck like this.)

Okay, the suite stopped! About 8 hours after it froze, and maybe an hour after my attempts to kill it - I stopped watching it over an hour ago, anyway.

If you can’t kill a process with kill -9 that almost certainly means the process is stuck in uninterruptible sleep waiting for IO , see https://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work
If so, the ps or top commands will show the process in state D.
(e.g. try ps -eo state,start,pid,ppid,euser,pcpu,time,cmd --sort=state).
Not a lot you can do except sort out your filesystem issues.