Https messaging fails, communication error between HPC and cylc hosts

Dear all,

I’m installing cylc/rose on a new system which has CentOS8 on both cylc host and HPC host. I cannot get the two to talk to each other via the default “https” method.

This is my system-wide cylc config /opt/cylc/cylc-flow-7.9.1/etc/global.rc on the cylc host:

[communication]
method = https
[hosts]
[[localhost]]
copyable environment variables = FCM_VERSION, ROSE_VERSION
[[cl*-master*]] # this should be name of compute cluster
copyable environment variables = FCM_VERSION, ROSE_VERSION
retrieve job logs = True
retrieve job logs max size = 32M
retrieve job logs retry delays = PT10S, PT30S, PT3M
task communication method = ssh

My system-wide rose config in /opt/rose/rose-2019.01.3/etc/rose.conf:

[rose-host-select]
default=cl3-master

This warning during suite startup may have something to do with it?

[WARN] CherryPy Checker:
[WARN] The use of ‘localhost’ as a socket host can cause problems on newer systems, since ‘localhost’ can map to either an IPv4 or an IPv6 address. You should use ‘127.0.0.1’ or ‘[::1]’ instead.

But where do I replace localhost with the loopback address?

The suite starts up, but the gcontrol doesn’t get any feedback from the HPC host - the messages don’t make their way back to the cylc host.

The first thing I checked is password ssh is enabled in both directions.

Then I checked the firewallD on the cylc host and enabled the port range 43000-43100:

sudo firewall-cmd --permanent --zone=public --add-port=43000-43100/tcp

I can now connect from the HPC host back to the cylc host on any port in that range.

But when the suite starts (rose suite-run), the gcontrol pops up and says not connected. I forced an update using Control > Poll All ... and after a timeout I see this error message:

Cannot connect: https://localhost:43015/poll_tasks: (‘Connection aborted.’, error(104, ‘Connection reset by peer’))

I then checked the local logs on the cylc host to see if any task has been launched in the background. And I saw a directory already created for the usual cold start tasks, and I checked the local logs at ~/cylc-run/SUITE/log/job/20200922T0000Z/TASK/NN/job.err and found this:

2020-09-23T12:16:50Z WARNING - Message send failed, try 1 of 7: Cannot connect: https://localhost:43015/put_messages: (‘Connection aborted.’, error(104, ‘Connection reset by peer’))

It attempts that connection 7 times and then exits with an error.

Then I checked the HPC and the cold start tasks were actually running, so the job submission had been successful, just the HPC couldn’t talk back to the cylc host to keep the gcontrol updated.

A similar error message to the above was displayed when trying to kill the suite with rose shut-down:

Cannot connect: https://localhost:43015/set_stop_cleanly?kill_active_tasks=False: (‘Connection aborted.’, error(104, ‘Connection reset by peer’))

The shut-down doesn’t work - the tasks keep running on the HPC host, and I am left wth a half-dead suite on the cylc host, which I have to clean up manually (delete “~/cylc-run/SUITE/.service/contact” and kill the pid of the “python2 /opt/cylc/cylc-flow-7.9.1/bin/cylc-run” process)

So how can I debug why the task messaging is failing?
Thanks!

I’ve tried a few more things to see what’s going on. E.g. by using the open_ssl client to test a https connection:

from the HPC host to the old cylc host:

$ openssl s_client -connect cylchost:43077
CONNECTED(00000003)
23456247909248:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:ssl/record/ssl3_record.c:332:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 5 bytes and written 299 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---

but when I try to talk back from the HPC to the new cylc host I get this:

$ openssl s_client -connect cylchost2:43023
23456247909248:error:0200206F:system library:connect:Connection refused:crypto/bio/b_sock2.c:110:
23456247909248:error:2008A067:BIO routines:BIO_connect:connect error:crypto/bio/b_sock2.c:111:
connect:errno=111

Is there something wrong with the SSL setup of the new cylc host? It’s a fresh CentOS8 installation, and I haven’t fiddled with it too much other than installing all the prerequisites for cylc/rose/fcm.

Thanks for any ideas!

Hello,

Ok, lots going on here, lets start looking at local comms and get that sorted before moving on to remote issues.

The following error from the GUI is curious:

To locate a running suite Cylc uses the contact file (located in ~/cylc-run/<suite>/.service/contact).

This message suggests that CYLC_SUITE_HOST is set to localhost in that file which, if true, is likely the issue. I would expect this to be set to the FQDN for that host (i.e. hostname -f).

Could you check the contact file of a running suite to see what CYLC_SUITE_HOST is set to as it could be to do with host self-identification. Also check the output of hostname and hostname -f.

Typically these two should match:

$ hostname
$ python2 -c 'import socket; print socket.gethostname()'

As should these two:

$ hostname -f
$ python2 -c 'import socket; print socket.getfqdn()'

Oliver

Thanks for the reply. That’s really useful.
The first two hostnames match (hostname and socket.gethostname()), but the second one doesn’t:

$ hostname -f
localhost

$ python2 -c ‘import socket; print socket.getfqdn()’
localhost.localdomain

This is a fairly fresh install, so it should be easy to track down why these mismatch.

I’ll go digging. Thanks!

Right, so I have done some digging and the system now returns the same for all those 2 pairs of commands.
Turns out I had made a mistake in my /etc/hosts shortly after installation when I included the host name in the loopback line (at the end of the line starting with 127.0.0.1). That was wrong.

Rather than inventing a domain that wasn’t actually resolved locally I have now set up /etc/hosts in a way that returns the simple hostname for all of the commands.

Now when I start the suite with rose suite-run -- --debug the feedback in the shell window looks fine, and the gcontrol pops up, but then the suite doesn’t connect. At the bottom of the gcontrol I can see “stopped with ‘submitted’” and (not connected).

When the menu option “Poll All …” now returns this error message.

Cannot connect: https://cylchost2:43002/poll_tasks: ('Connection aborted.', error(104, 'Connection reset by peer'))

It mentions the correct hostname, but still comes up with error code 104.

The contents of the contract file (~/cylc-run/model_plotter_dev/.service/contact) look fine:

CYLC_API=2
CYLC_COMMS_PROTOCOL=https
CYLC_DIR_ON_SUITE_HOST=/opt/cylc/cylc-flow-7.9.1
CYLC_SSH_USE_LOGIN_SHELL=True
CYLC_SUITE_HOST=cylchost2
CYLC_SUITE_NAME=model_plotter_dev
CYLC_SUITE_OWNER=model
CYLC_SUITE_PORT=43002
CYLC_SUITE_PROCESS=23354 python2 /opt/cylc/cylc-flow-7.9.1/bin/cylc-run model_plotter_dev --debug
CYLC_SUITE_RUN_DIR_ON_SUITE_HOST=/home/model/cylc-run/model_plotter_dev
CYLC_SUITE_UUID=d811c89c-6d85-4319-b097-630b65eece17
CYLC_TASK_MSG_MAX_TRIES=7
CYLC_TASK_MSG_RETRY_INTVL=5.0
CYLC_TASK_MSG_TIMEOUT=30.0
CYLC_VERSION=7.9.1

Then I tried to communicate from the HPC host back to the cylc host. This now does does something (it sayd CONNECTED), but still results in the error 104 (which is also mentioned in the error message above):

$ openssl s_client -connect cylchost2:43002
CONNECTED(00000003)
write:errno=104
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 299 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---

It looks like SSL handshake is failing? Now I’m out of ideas.
Thanks for any hints!

I found that if I switch the communication method in global.rc to

[communication]
method = http

then I have no problem now. The suite communicates properly back and forth and the gcontrol updates properly. I can even query the cylc host and port number using curl to get an http error 401 error page:

$ curl http://cylchost2:43027/

This returns a page with this error
HTTPError: (401, ‘You are not authorized to access that resource’)

But when I switch to https with:

[communication]
method = https

there is definitely a problem.
The startup text reports that the suite now listening on https:

[INFO] *** listening on https://cylchost2:43069/ ***

so I tried the same curl request, but get an error:

$ curl https://cylchost2:43069/
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cylchost2:43069
this is irrespective of whether I try the curl command on the HPC host or the cylc host itself.

I find it strange that nmap can’t see the port 43069 as open:

$ nmap cylchost2
Starting Nmap 7.70 ( https://nmap.org ) at 2020-09-24 13:23 +04
Nmap scan report for cylchost2 (192.168.37.106)
Host is up (0.00040s latency).
Not shown: 997 closed ports
PORT STATE SERVICE
22/tcp open ssh
111/tcp open rpcbind
10000/tcp open snet-sensor-mgmt

Nmap done: 1 IP address (1 host up) scanned in 0.07 seconds

I also installed wireshark to get any clues, but I can’t see anything that would tell me what’s wrong.
Other than that the SSL connection get rudely interrupted immediately with a RST/ACK packet.

The wireshark filter is ip.addr==192.168.37.106 and (tcp.port >= 43000 and tcp.port <= 43100) where 192.168.37.106 is the IP address of the cylc host.

Then I submit the command openssl s_client -connect cylchost2:43069 from the HPC host (communicating backwards).

The openssl command returns as before with:

CONNECTED(00000003)
write:errno=104

and the traffic registered in wireshark is:

2311 2.808934188 192.168.37.106 192.168.37.101 TCP 66 43069 → 38924 [RST, ACK] Seq=1 Ack=1 Win=235 Len=0 TSval=1878864944 TSecr=3953012286

where 192.168.37.106 is the cylc host and 192.168.37.101 is the HPC host.

The strange is that I thought I am submitting a request to contact cylhst2 on port 43069, but the destination port is actually 38924. But then again I don’t know much about how TCP actually works.

I can’t compare it to a system where https in cylc works. In the past I’ve always given up and used http before. But now I’m doing a big systems upgrade with new hardware and upgraded software, so I’d like to get this right and use https as intended.

If anyone has any ideas on how to get the SSL handshake working, I’d love to hear it.

Hmmm, RST, ACK might be the clue…

Hi again, is there any documentation on how the SSL connection between cylc host and HPC host is supposed to work? What are the steps to follow to establish an SSL connection? How can one check that these steps have been implemented?

I am lacking the basic knowledge of how an SSL handshake is supposed to work when everything works well, so I have no idea what to do when things don’t work as expected.

Both cylc host and HPC host are very recent installs of CentOS v8.2.

So I’d like to understand what makes it work in a setup where https communication works fine.

Thanks,
Fred

Hi Fredw,

In your experiment to 43069, the packet you captured is the response from cylchost2:43069 to hpc/the sender on 192.168.37.101:38924.

We can see the RST which is the the TCP Reset that becomes the error ‘Connection reset by peer’ you see. And of course an ACK so we know there is a machine at the other end :slight_smile:

The 38924 will be OS assigned and is unlikely to be anything to do with the issue.

If you check your logs on the host, is there any message saying “No HTTPS/OpenSSL support.”? I’m not expecting to see this as the server should shutdown and I don’t think this is happening?

Mel

1 Like

Hi again, is there any documentation on how the SSL connection between cylc host and HPC host is supposed to work? What are the steps to follow to establish an SSL connection? How can one check that these steps have been implemented?

Cylc doesn’t implement any handshake logic, this is all handled by the standard Python libraries Cylc uses.

By preference Cylc uses the requests library if installed, else it falls back on urllib2.

Here is the requests library documentation on SSL certificate validation:

https://2.python-requests.org/en/master/user/advanced/#ssl-cert-verification

For reference, here is the Cylc client call to the request library:

And here is the alternative urllib2 call:

Cylc uses pyOpenSSL to generate the certificates, the logic that creates them can be found here:

Before digging further into Cylc it may be worth trying to form connections using the requests library in Python2.

1 Like