Suite got killed quietly after running for about 10 days

Hi,

I have a suite which has tasks processing real time data using python scripts on a daily basis.

It starts daily cycle at given time every day and firstly running check_file tasks to check whether the data of today are ready with ‘execution retry delays = 14400*PT10M’.

If the data (created by another suite) are not ready, it will run again 10m later.

Normally the task will retry more than 200 times (~1.4 days) until it succeeded (due to the data are ready), then the actual tasks start to process today’s data.

But after around 10 days, the suite would get killed quietly (or stopped automatically).
(no final cycle point is specified in the suite, so it wasn’t running out of cycles)

By the time it got killed, only the check_file tasks were running. Other tasks were in waiting status.
Most of time, I couldn’t find any error messages related to the suite killing.
Some time I could find this message:
NewConnectionError(’<requests.packages.urllib3.connection.VerifiedHTTPSConn=

ection object at 0xc9eb90>: Failed to establish a new connection: [Errno 11= 1] Connection refused’,))

Do you have any idea about the reason? Is there any way to debug this problem?

Thanks

Hailin

Are you by chance running on a computer system that imposes CPU time limits for processes? We’ve seen that with some of our HPC systems where the login nodes will kill processes after they exceed a certain CPU time limit (which can be quite long with Cylc).

Hi Yan,

Tim’s suggestion above seems important.

Aside from that, I’ve thought about the problem you outline & undertaken some investigation into what it could be based on the information provided.

However, given the context & that particular error message, which could emerge from a number if sources, I think we (the Cylc development & support team) would need more information to determine what has happened & why.

One thing that does strike me is that you are setting up a very large number of retries overall for that task, so that for around 10 days you have around (102460)/10~=1500 jobs having run (also with 5 job files created for each) so that it is plausible you are hitting some sort of limit, e.g. the file descriptor limit, or a limit on file size with the correspondingly-large suite log file, or even if you are running low & approach limits for disk space so that Cylc cannot write the files it needs to handle the running of your suite.

On that note, could you please let us know the following:

  1. The Cylc version you are using;
  2. The output of a ‘quotas’ command to confirm you have sufficient disk space;
  3. The batch system you were using to submit the ‘check_file_task’ & any ‘suite.rc’ [runtime] information environment or directives or scripts that might otherwise be relevant for the ‘check_file_task’;
  4. The final (maybe 10 or so) lines from the ‘log/suite/log’ & the general format & message of any lines marked as ‘CRITICAL’ or ‘WARNING’ (no need for task name details).
  5. If you can determine this, roughly (the order of magnitude) of the number of tasks in waiting status when the ‘check_file_task’ was still running (is this very large?)

Sorry I can’t be more help at this stage.

Is that error in the job log, or the suite server log? If the former, it could just indicate that an orphaned job was unable to connect to the server because the server just got killed.

In general terms:

  • If the server program died due to some internal error, you should see a Python traceback in the server log.

  • If the server program shut itself down on detecting some error condition that it can’t handle (e.g. disk quota exceeded) you should see a sensible error message in the server log.

  • if the server program was killed, e.g. by your OS for exceeding some CPU time limit, or manually by a system administrator, then depending on the kill method (e.g. kill -9 PID) it may not be possible for Cylc to log any kind of debug info before it dies.

Hilary

Hi Tim,

How do I know whether this are CPU time limits and what the limits are?

Thanks
Hailin

Thanks Hilary.

Where is the suite server log stored?
How do I know whether this are CPU time limits? I asked Tim the same question.

Hailin

Hi Hailan,

Your suite job logs are in ~/cylc-run/SUITE-NAME/log/job.

Your suite server log is in ~/cylc-run/SUITE-NAME/log/suite.

You can view both with cylc cat-log, if you want to avoid remembering the file paths.

How do I know whether this are CPU time limits?

I don’t know about that, sorry (well, it’s not really a Cylc problem, it could potentially affect any running program). I guess you’d have to ask your system admins if there are such limits on your suite hosts.

Hilary

Thanks for your reply.

Please see below the info you asked for.

  1. cylc version: 7.7.2

  2. Disk quotas for group pr_s1proc (gid 1831):
    Filesystem used quota limit grace files quota limit grace
    /g/sc/fs5 71.58T 90T 90T - 4666361 0 0 -

  3. batch system = pbs
    execution retry delays = 14400*PT10M
    execution time limit = PT10M

  4. log/suite/log:
    2019-07-14T10:54:51Z INFO - [check_time_avg_bias_atmos_tasmax_20_39.20190712T0600Z] -(current:running)> succeeded at 2019-07-14T10:54:50Z
    2019-07-14T10:54:57Z INFO - [rgn_time_avg_biascorrected_tasmax_40_59.20190712T0600Z] -(current:ready) submitted at 2019-07-14T10:54:51Z
    2019-07-14T10:54:57Z INFO - [rgn_time_avg_biascorrected_tasmax_40_59.20190712T0600Z] -job[01] submitted to localhost:pbs[7220079.terra-pbs]
    2019-07-14T10:54:57Z INFO - [rgn_time_avg_biascorrected_tasmax_40_59.20190712T0600Z] -health check settings: submission timeout=None
    2019-07-14T10:54:57Z INFO - [stn_time_avg_calibrated_pr_emn.20190712T0600Z] -submit-num=1, owner@host=localhost
    2019-07-14T10:55:03Z INFO - [stn_time_avg_calibrated_pr_emn.20190712T0600Z] -(current:ready) submitted at 2019-07-14T10:55:02Z
    2019-07-14T10:55:03Z INFO - [stn_time_avg_calibrated_pr_emn.20190712T0600Z] -job[01] submitted to localhost:pbs[7220091.terra-pbs]
    2019-07-14T10:55:03Z INFO - [stn_time_avg_calibrated_pr_emn.20190712T0600Z] -health check settings: submission timeout=None
    2019-07-14T10:55:08Z INFO - [rgn_time_avg_anom_ocean_sst_60_79.20190712T0600Z] -(current:running)> succeeded at 2019-07-14T10:55:05Z
    2019-07-14T10:55:13Z INFO - [rgn_time_avg_calibrated_pr_emn.20190712T0600Z] -submit-num=1, owner@host=localhost
    2019-07-14T10:55:13Z INFO - [rgn_time_avg_calibrated_pr_1_19.20190712T0600Z] -(current:submitted)> started at 2019-07-14T10:55:08Z
    2019-07-14T10:55:13Z INFO - [rgn_time_avg_calibrated_pr_1_19.20190712T0600Z] -health check settings: execution timeout=PT3H10M, polling intervals=PT3H1M,PT2M,PT7M,…
    2019-07-14T10:55:18Z INFO - [rgn_time_avg_calibrated_pr_emn.20190712T0600Z] -(current:ready) submitted at 2019-07-14T10:55:18Z
    2019-07-14T10:55:18Z INFO - [rgn_time_avg_calibrated_pr_emn.20190712T0600Z] -job[01] submitted to localhost:pbs[7220092.terra-pbs]
    2019-07-14T10:55:18Z INFO - [rgn_time_avg_calibrated_pr_emn.20190712T0600Z] -health check settings: submission timeout=None
    2019-07-14T10:55:23Z INFO - [rgn_time_avg_calibrated_pr_60_79.20190712T0600Z] -(current:running)> succeeded at 2019-07-14T10:55:22Z

  5. 800 jobs are in waiting status.

By the way, I am running other suites which also do real time processing in the similar way as this suite.
They are running with the same cylc version, batch system, HPC, etc.
and they are running well for a few months without any problem.
But they have less number of waiting jobs, which are 250 jobs.
The total number of retry times of check_file of those suite is around 276 (1.92 days).
and the number of this suite is around 303 (2.1 days)

Thanks
Hailin

Hi Hailan,

A couple of quick thoughts:

  • cylc-7.7.2 is a year old now, you should really upgrade.
  • your suite log shows no sign of any error before shutdown, which is consistent with (although not definitive proof of) the server program being killed (e.g. kill -9) with no chance to log a shutdown message. If that’s what happened, you really have to figure out what killed it. Maybe the system (/OS) logs show processes killed for using too much memory, or too much CPU time, or something…?

Hilary

Thanks Hilary.

I used ‘ps -ef’ command to check the suites running and got the info below.
hyan 24273 1 5 Jul01 ? 16:11:29 cylc-7.7.2/bin/cylc-run averaging_v4_weekl
hyan 146420 1 0 Jun11 ? 02:10:27 cylc-7.7.2/bin/cylc-run calibration_v3_wee

averaging_v4_weekl has accumulated CPU time for 16 hours. It has been running for 15 days. According to my experience it will get killed soon once the time hits a limit.
(I haven’t yet confirmed the limit with my system administrator)
While calibration_v3_wee has been running for 25 days, but its accumulated CPU time is around 2 hours.

The suite calibration_v3_wee has been using more processors (256+) with MPI running on compute nodes, but the suite averaging_v4_weekl uses MAMU nodes to run more than 200 tasks that only require 1 CPU to run with bash or Python.

Is CPU time used by MPI run not captured by ps command?
Do you have any suggestion to avoid the CPU time accumulation?

Hailin

Presumably you mean your suite is managing MPI jobs. That should not be relevant here at all, because the jobs that a suite runs are separate processes (and they usually run on other machines too) (so ‘ps’ will definitely not see them as part of the cylc-run server process). It is only the resource that the suite server program uses, on the suite host, that matters (in terms of whether or not the server program gets killed for exceeding some limit).

How much CPU time a Cylc server program will use depends mostly (I would guess) on the size of the suite - how many tasks it has to keep track of and compute the scheduling for. - and how active the suite is. So for a given suite, the only way to avoid a hard CPU time limit would be to periodically stop and restart the suite, so that it never exceeds the limit in one run. You could probably write a small script to monitor CPU time use (via ps) and issue shutdown and restart commands periodically.

Hilary

Hi Hailin -

Information about any system limits and the values thereof would need to come from the documentation you have on the shared system, or (more reliably) from the system administrators.

As @hilary.j.oliver notes, the “CPU time” usage in this case is that of the Cylc server itself, which is not connected to the CPU time used by MPI tasks and any jobs launched by Cylc. We haven’t come up with a good solution to deal with it other than “check in every now and then and restart if necessary” simply because it’s so hard to actually predict/handle when it would happen, and easy enough (though annoying) to work around.

This is another argument to go the route of having the suite servers running on separate systems that are lightweight and don’t have these limits attached.

Tim

Yes, definitely best to run your suite server programs on a dedicated front-end host (with no CPU time limits) … or even better, a small pool of them. Then you get basic load balancing (suites can start up on the least loaded host) and the ability to tell suites to self-migrate to another host (e.g. for system maintenance).