Dear all,
I’m installing cylc/rose on a new system which has CentOS8 on both cylc host and HPC host. I cannot get the two to talk to each other via the default “https” method.
This is my system-wide cylc config /opt/cylc/cylc-flow-7.9.1/etc/global.rc
on the cylc host:
[communication]
method = https
[hosts]
[[localhost]]
copyable environment variables = FCM_VERSION, ROSE_VERSION
[[cl*-master*]] # this should be name of compute cluster
copyable environment variables = FCM_VERSION, ROSE_VERSION
retrieve job logs = True
retrieve job logs max size = 32M
retrieve job logs retry delays = PT10S, PT30S, PT3M
task communication method = ssh
My system-wide rose config in /opt/rose/rose-2019.01.3/etc/rose.conf
:
[rose-host-select]
default=cl3-master
This warning during suite startup may have something to do with it?
[WARN] CherryPy Checker:
[WARN] The use of ‘localhost’ as a socket host can cause problems on newer systems, since ‘localhost’ can map to either an IPv4 or an IPv6 address. You should use ‘127.0.0.1’ or ‘[::1]’ instead.
But where do I replace localhost with the loopback address?
The suite starts up, but the gcontrol doesn’t get any feedback from the HPC host - the messages don’t make their way back to the cylc host.
The first thing I checked is password ssh is enabled in both directions.
Then I checked the firewallD on the cylc host and enabled the port range 43000-43100:
sudo firewall-cmd --permanent --zone=public --add-port=43000-43100/tcp
I can now connect from the HPC host back to the cylc host on any port in that range.
But when the suite starts (rose suite-run
), the gcontrol pops up and says not connected
. I forced an update using Control > Poll All ...
and after a timeout I see this error message:
Cannot connect: https://localhost:43015/poll_tasks: (‘Connection aborted.’, error(104, ‘Connection reset by peer’))
I then checked the local logs on the cylc host to see if any task has been launched in the background. And I saw a directory already created for the usual cold start tasks, and I checked the local logs at ~/cylc-run/SUITE/log/job/20200922T0000Z/TASK/NN/job.err
and found this:
2020-09-23T12:16:50Z WARNING - Message send failed, try 1 of 7: Cannot connect: https://localhost:43015/put_messages: (‘Connection aborted.’, error(104, ‘Connection reset by peer’))
It attempts that connection 7 times and then exits with an error.
Then I checked the HPC and the cold start tasks were actually running, so the job submission had been successful, just the HPC couldn’t talk back to the cylc host to keep the gcontrol updated.
A similar error message to the above was displayed when trying to kill the suite with rose shut-down
:
Cannot connect: https://localhost:43015/set_stop_cleanly?kill_active_tasks=False: (‘Connection aborted.’, error(104, ‘Connection reset by peer’))
The shut-down doesn’t work - the tasks keep running on the HPC host, and I am left wth a half-dead suite on the cylc host, which I have to clean up manually (delete “~/cylc-run/SUITE/.service/contact
” and kill the pid of the “python2 /opt/cylc/cylc-flow-7.9.1/bin/cylc-run
” process)
So how can I debug why the task messaging is failing?
Thanks!