Is the SSL certificate verification the first thing that happens in the task’s comms back to the server?
Cylc7 uses HTTPS, the verification must happen before the server will act on an incoming message.
Maybe if there’s a general network issue
A transient network issue would result in a temporary period of message failures, however, the errors you are seeing appear more nuanced (CERTIFICATE_VERIFY_FAILED
) and persistent which hints at a larger issue.
I’m basically wondering whether the problem is with the Cylc server
The Cylc server is a fairly basic HTTPS server, I would expect it to work or not. semi-intermittent cert verification failure seems a bit too exotic to come from Cylc.
The only Cylc origin of this error I can think of is as follows, it’s seriously niche:
- A scheduler submits jobs on a remote host.
- AND the scheduler crashes fatally so fails to cleanup the relevant files on the remote host.
- AND another suite is started up and by complete coincidence starts on the same host:port as the first one (relatively unlikely unless the
suite server
port range is narrow or congested).
- AND providing the first suite is not restarted within this time window.
- BEFORE the submitted job has completed.
This would create a time window in which cylc message
calls from jobs submitted by the first scheduler would arrive at the second scheduler where they would be rejected due to cert verification failure.
This window would close once the original suite is restarted and the remote-init completes.
(A non-Cylc variant of the above is if an HTTPS server (rather than a Cylc scheduler) started up at the same host:port pair).
Cylc 7 Certificate Lifecycle
The suite passphrase and SSL certs are created when the scheduler starts up and (under normal conditions) are removed when the scheduler shuts down (the exception being when a scheduler crashes due to a fatal error).
Before a task is submitted to a “new” remote host Cylc makes a “remote-init” call, this syncs the passphrase and SSL certs to the remote platform along with a “contact file”. These files enable cylc message
to work out where the scheduler is running and to connect to it securely.
On restart Cylc goes through the list of tasks which were running/submitted from the previous run (which can happen e.g. with cylc stop --now
) and checks their job.status
files to see if any messages were logged whilst the scheduler was down. Cylc also makes another “remote-init” call to re-sync the SSL certs etc. Once this is done cylc message
calls from any running jobs submitted before the restart will make it through to the scheduler.
Workaround?
Cylc 7 offers two other task communication methods, SSH and polling. If you are able to SSH back from the job host to the Scheduler host then SSH might be a good option.
Polling works fine but it’s pull rather than push.
Cylc 8
Cylc 8 uses TCP+ZMQ rather than TCP+HTTP(s) and uses a ZMQ “curve” keypair rather than a passphrase and SSL cert. Otherwise it works in much the same way.