Background tasks occasionally stuck at "submitted"

I occasionally have background (i.e. not batch queued) tasks for which all the prerequisites are fulfilled, that go from waiting to submitted, and just stay there for a long time. Sometimes in these cases, they finally start running; in other cases, they stay that way for 30 or more minutes until my patience runs out and I intervene manually. There is no internal queue in use, and there’s no issue with runahead. In fact, this happens in situations like this:

foo => bar
foo => baz

. . .and then bar will run but baz will stay stuck in submitted, even as tasks in the next cycle point that depend on this cycle point’s task bar will go ahead and run.

I have not been able to predict when this will happen and when it will not.

Any ideas on how to run this down? Thanks.

Are you sure you mean stuck as submitted and not as ready? Inside the Cylc scheduler, job submission commands are queued in the ready state to a process pool, which also handles event handler and xtrigger execution, the size of which (max number of processes) is a global config item (default 4, I think). If the process pool gets tied up with (e.g.) hung event handlers, tasks can get stuck as ready for a while.

If your tasks really do get stuck, infrequently, as submitted, that normally implies a network problem prevented the “started” message getting sent back to the scheduler. Tasks can recover from that state eventually if the scheduler uses polling to figure out the true state of the job. If this is what’s happening you should see:

  • message send errors reported in the job.err file for the task
  • and the job should be seen as actually executing on the host, using ps or top etc.

I can’t think of any other reason for background jobs to get stuck as submitted, because tasks only enter the submitted state when the job submission command exits with success status, and in the background case that should mean the job script is now executing.

Hilary

p.s. you can simulate this by adding init-script = sleep 10 to your tasks. The init-script is executed first thing in the job script, before the job messaging command, so this will make tasks appear as “submitted” for 10 seconds after they have actually started executing.

Hi, and thanks for getting back to me quickly! I’m pretty sure I do mean submitted rather than ready: the tasks show up as submitted rather than ready when I do cylc dump, are show in the submitted color code under cylc monitor, etc. (and I have seen ready in these).

I’m absolutely willing to believe that it’s some kind network issue, since I’ve seen internal-to-cluster network issues cause problems that show up in job.err at other (later) times in PBS-managed tasks. (Would network issues still be a sticking point for background tasks running on the same host as the Cylc server, though? That’s what I’m seeing.) I’m going to have to investigate the next time this happens to see if the symptoms you mention are present.

That does seem unlikely, although perhaps not impossible.

To add a bit to my explanation above, the submitted state means that the job submission command returned success status, but the “started” message has not come in from the running job yet (assuming you are not using the polling method of tracking job status).

For background tasks there is no external queuing system, so the only ways a task could appear to be stuck as submitted are:

  • the job submission command returned success but actually failed (job never ran) - seems unlikely
  • the job started running but was immediately killed (kill -9 PID) - seems unlikely?
  • the job ran but was unable to send its started (and then succeeded or failed) message back - unlikely for local jobs

If it happens again, can you try to check if the seemingly-submitted task job actually ran and completed, or started running but did not complete (but did not report as failed), or never even started running. (Cylc will report the PID of the job, by the way).

OK, it’s happening to me again right now. I have a task stuck as submitted. The Cylc-created task script “job” in the job log directory shows up in the ps display. However, my python script that actually performs the task, and is pointed-to in the cylc__job__inst__script() function, does not show up in the output of ps. Further, the first thing that Python script does is write a message to a log file; no such message is in the file, so that script seemingly hasn’t been invoked (and therefore, presumably, cylc__job__inst__script() isn’t being called).

The log files job.out and job.err have zero length.

Damn, I’ve never seen that and I’m struggling to understand what could cause it.

The problem is intermittent, right, for any task, not a regular failure for a particular task?

Another thing you can try is to start the scheduler in debug mode: cylc run --debug <suite-name>. This will turn on set -x early in task job scripts, and the output should be directed to job.xtrace in the job log directory. This should tell us exactly where the boilerplate job script code is (evidently) hanging before it tries to invoke your Python script. The cylc__job__main shell function is at the top of <cylc-dir>/lib/cylc/job.sh, by the way. It must be hanging quite early on, before the info header is echoed to job.out (at line 83 in cylc-7.8.3, which is what you’re running going by another recent post?) well before cylc__job__inst_script is invoked to run your Python script.

To avoid messing with your real suite, you could of course write a short dummy suite that just repeatedly runs the same background task until it hits the problem.

Hilary

You could also try these commands, to see what the job process is doing:

  • strace -p PID
  • lsof -p PID

(where PID is the process ID of job, in your ps output).

Is the server under high load or are you submitting a lot of jobs at once or do you have lots of or expensive event handlers active? By default I think Cylc will only process 4 submissions/event handlers (and maybe other stuff?) at once, so if you have a lot going on it could leave things lagging and appear stuck.

Hi @TomC - yeah, an over-subscribed Cylc subprocess pool can delay job submissions. However that should cause tasks to appear stuck as “ready” rather than “submitted”.

OK, a follow-up (finallly) to this. In some, but not all, cases where this occurs, I see something like this appear in the job.out file of the job that’s stuck at “submitted”:

2021-06-15T22:32:31Z INFO - started
2021-06-15T22:32:32Z WARNING - Message send failed, try 1 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:32:38Z WARNING - Message send failed, try 2 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:32:43Z WARNING - Message send failed, try 3 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:32:48Z WARNING - Message send failed, try 4 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:32:53Z WARNING - Message send failed, try 5 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:32:59Z WARNING - Message send failed, try 6 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:33:04Z WARNING - Message send failed, try 7 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
2021-06-15T22:33:16Z INFO - succeeded
2021-06-15T22:33:17Z WARNING - Message send failed, try 1 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:33:22Z WARNING - Message send failed, try 2 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:33:27Z WARNING - Message send failed, try 3 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:33:32Z WARNING - Message send failed, try 4 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:33:38Z WARNING - Message send failed, try 5 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:33:43Z WARNING - Message send failed, try 6 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)
   retry in 5.0 seconds, timeout is 30.0
2021-06-15T22:33:48Z WARNING - Message send failed, try 7 of 7: Cannot connect: https://my.hostname.snipped:43095/put_messages: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)

I don’t know enough to know whether this is a Cylc issue or a host issue

Addendum: I have verified that the port the task is trying to connect to, 43095, is the port that the Cylc server process is currently listening on.

OK, that at least makes sense. The job appears to be stuck as submitted because it was not able to send the started (and later, succeeded) message back to the server.

I’m not sure what to suggest next to debug this, sorry. I’ve never seen it before, and it is strange that SSL certificate verification is failing (for the https connection to the server) intermittently and infrequently.

If the problem really is infrequent it might be easiest to work around it by configuring job polling for the affected job host, so that the correct task status will at least be picked up automatically after some delay, and the workflow can carry on. (Messaging failures do not actually cause job failure, they job stop the status message getting back).

Cylc 8 does not use https for communicating with the scheduler, by the way, so at least you won’t see this particular problem after upgrading.

Is the SSL certificate verification the first thing that happens in the task’s comms back to the server? I’m basically wondering whether the problem is with the Cylc server or with the network on the HPC. Maybe if there’s a general network issue, it would show up in the task’s log as a failure to do the first thing in the communication attempt? I’m fairly ignorant about this stuff so I dunno if that makes sense.

Is the SSL certificate verification the first thing that happens in the task’s comms back to the server?

Cylc7 uses HTTPS, the verification must happen before the server will act on an incoming message.

Maybe if there’s a general network issue

A transient network issue would result in a temporary period of message failures, however, the errors you are seeing appear more nuanced (CERTIFICATE_VERIFY_FAILED) and persistent which hints at a larger issue.

I’m basically wondering whether the problem is with the Cylc server

The Cylc server is a fairly basic HTTPS server, I would expect it to work or not. semi-intermittent cert verification failure seems a bit too exotic to come from Cylc.

The only Cylc origin of this error I can think of is as follows, it’s seriously niche:

  • A scheduler submits jobs on a remote host.
  • AND the scheduler crashes fatally so fails to cleanup the relevant files on the remote host.
  • AND another suite is started up and by complete coincidence starts on the same host:port as the first one (relatively unlikely unless the suite server port range is narrow or congested).
  • AND providing the first suite is not restarted within this time window.
  • BEFORE the submitted job has completed.

This would create a time window in which cylc message calls from jobs submitted by the first scheduler would arrive at the second scheduler where they would be rejected due to cert verification failure.

This window would close once the original suite is restarted and the remote-init completes.

(A non-Cylc variant of the above is if an HTTPS server (rather than a Cylc scheduler) started up at the same host:port pair).

Cylc 7 Certificate Lifecycle

The suite passphrase and SSL certs are created when the scheduler starts up and (under normal conditions) are removed when the scheduler shuts down (the exception being when a scheduler crashes due to a fatal error).

Before a task is submitted to a “new” remote host Cylc makes a “remote-init” call, this syncs the passphrase and SSL certs to the remote platform along with a “contact file”. These files enable cylc message to work out where the scheduler is running and to connect to it securely.

On restart Cylc goes through the list of tasks which were running/submitted from the previous run (which can happen e.g. with cylc stop --now) and checks their job.status files to see if any messages were logged whilst the scheduler was down. Cylc also makes another “remote-init” call to re-sync the SSL certs etc. Once this is done cylc message calls from any running jobs submitted before the restart will make it through to the scheduler.

Workaround?

Cylc 7 offers two other task communication methods, SSH and polling. If you are able to SSH back from the job host to the Scheduler host then SSH might be a good option.

Polling works fine but it’s pull rather than push.

Cylc 8

Cylc 8 uses TCP+ZMQ rather than TCP+HTTP(s) and uses a ZMQ “curve” keypair rather than a passphrase and SSL cert. Otherwise it works in much the same way.

1 Like