When the task ends abnormally(EXIT), cylc gui task status error

Hi,

We are using the cylc library version is 7.8.11.
There were some tasks that ended abnormally(EXIT) due to hanging error while performing some tasks, exceeding the wall time limit.

error logs: Exited by LSF signal TERM_RUNLIMIT.

In this case, the task status is not reflected in the GUI and is displayed as ‘running(green color)’.
Even if I connected with ‘poll’ in the cylc gui, LSF job scheduler status(EXIT) is not reflected and is displayed as ‘running(green color)’

Also we suggest that the default version of the cylc library is 7.8.11, so it cannot be upgraded.

I wonder if anyone else might have a fix for this problem?

many thanks,
Sonia

Hi @sonia,

(I know you already know this from an earlier communication, but for others reading - Cylc 7.8.11 is very old - you should upgrade, or encourage your site admins to do that).

To make sure I understand:

  • some task jobs were killed by your load manager (LSF) for exceeding the wall clock limit
  • the failed (killed) job status does not get communicated back to the Cylc scheduler
  • even if you poll the job from Cylc, that does not update the status to failed (killed)

Is that correct?

To help us see what the problem might be, can you answer a few questions please?:

  • do normal job status changes update correctly (waiting → submitted → running → succeeded); and do they update without job polling?
  • what happens if a job fails for internal reasons or gets killed by you (not LSF)? Does the status update correctly to “failed” in that case?

I’m not familiar with LSF, but presumalby the kill signal (for wall clock limit exceeded) is not trappable (like kill -9 aka kill -KILL). That means the Cylc job wrapper cannot trap the signal and send a job status message back before dying. However, job polling should still return the correct result after the kill has occurred.

  • do you know if job polling works in other circumstances, on your system?

The version cannot be changed because the rose climate experiment is set to the cylc/7.8.11 environment…

Your understanding is correct.

  • I tested Optimization test about the Intel mpi library version
    → When loading the impi/2021.2.0, hanging error occurred (used impi/2019.9.304(default)).
    → And then, the task jobs killed by LSF administrator for exceeding the wall clock limit (EXIT).
    → the failed job status does not get communicated back to the cylc gui(running status, green)
    → the poll tried from cylc gui, that does not update the status to failed(still running status, green)

  • do normal job status changes update correctly and do they update without job polling?
    → Yes

  • what happens if a job fails for internal reasons or gets killed by you (not LSF)?
    → Yes, It updated correctly to ‘failed(red)’ from cylc gui.
    → When I forcefully terminate the task kill(not from cylc gui), it is updated correctly to ‘failed’

  • do you know if job polling works in other circumstances, on your system?
    → Yes, the polling works is set to automatically update every 15 seconds.

This state occurs only when the job killed for exceeding the wall clock limit(EXIT) due to a hanging error.

If there is any source code in the cylc library that needs to be modified, please let me know.

(I don’t know the practical implications of that setting for you, but in principle the workflow engine version has no impact on scientific results - so long as it can run your workflow graph).

Thanks for answering my questions, that is helpful. However, it is mysterious that job status, including job failure, evidently updates correctly for everything except LSF wall clock time exceeded. Cylc does not care WHY a job failed, only IF it failed (etc.).

A job can be killed in two ways:

  • (1) A trappable signal like TERM, which essentially asks the process to shut down gracefully.
    • The Cylc job wrapper traps this and sends back a “failed” status message before exiting.
    • In this case, the job status updates immediately if you are using HTTP(S) job messaging (or TCP in Cylc 8), or else once polled.
  • (2) A non-trappable signal like KILL, which kills the process immediately.
    • This cannot be trapped, so the Cylc job wrapper can’t report status back before dying.
    • The job will appear to be “stuck” as running until Cylc polls it to update the status.

LSF is probably killing your job as in case (2) so you should not expect the job status to update correctly until Cylc polls the job (automatically, or at your request).

However, you seem to be saying that job polling does not detect the failure in this case, even though it works correctly in other cases?

That’s strange because the polling result does not depend on the reason for the job kill, only that (a) the job is no longer running - according to LSF; but (b) the job.status file did not report that it succeeded - which implies it must have failed.

Next time this happens (can you reproduce it on demand?) can you:

  • look for any poll-related messages in the job-activity.log for the task
  • query LSF manually to make sure the killed job is not still listed as running (unlikely, but I can’t see any likely option at this point!)
  • also look for any related messages (after initiating polling) in the scheduler log