Cylc Hub disconnection issues

Hi,

Following on from https://cylc.discourse.group/t/cylc-hub-issues/1126 we’re still not quite there with a working multi-user cylc hub setup.

We are seeing the Cylc UI disconnect from workflows every 52s (the red banner). We are struggling to work out what/where this disconnect is happening. Has anyone else encountered this?

We are running JupyterHub on our PUMA2 server with an Apache webserver running behind the ARCHER2 NGINX reverse web proxy. We have tried upping the NGINX and Apache proxy timelimits to no avail. The folks that manage the reverse proxy have told me they cannot see any dropped flows for out-of-state errors between the proxy and PUMA2.

In the PUMA2 apache access logs I’m seeing GET requests every 52s:

10.22.10.2 - - [30/Apr/2025:13:38:40 +0100] "GET /user/ros/cylc/subscriptions HTTP/1.1" 200 - "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36"
10.22.10.2 - - [30/Apr/2025:13:39:32 +0100] "GET /user/ros/cylc/subscriptions HTTP/1.1" 200 - "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36"
10.22.10.2 - - [30/Apr/2025:13:40:24 +0100] "GET /user/ros/cylc/subscriptions HTTP/1.1" 200 - "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36"
10.22.10.2 - - [30/Apr/2025:13:41:16 +0100] "GET /user/ros/cylc/subscriptions HTTP/1.1" 200 - "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36"

And in the system log when accessing, for example a log file via Cylc UI, tornado errors confirming closed websocket.

Apr 30 12:51:26 puma2 cylc[4072843]: Task exception was never retrieved
Apr 30 12:51:26 puma2 cylc[4072843]: future: <Task finished name='Task-344653' coro=<TornadoSubscriptionServer.on_start() done, defined at /home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py:129> exception=WebSocketClosedError()>
Apr 30 12:51:26 puma2 cylc[4072843]: Traceback (most recent call last):
Apr 30 12:51:26 puma2 cylc[4072843]:  File "/home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 148, in on_start
Apr 30 12:51:26 puma2 cylc[4072843]:    await self.send_execution_result(
Apr 30 12:51:26 puma2 cylc[4072843]:  File "/home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 173, in send_execution_result
Apr 30 12:51:26 puma2 cylc[4072843]:    await BaseSubscriptionServer.send_execution_result(
Apr 30 12:51:26 puma2 cylc[4072843]:  File "/home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/graphql_ws/base_async.py", line 181, in send_message
Apr 30 12:51:26 puma2 cylc[4072843]:    return await connection_context.send(message)
Apr 30 12:51:26 puma2 cylc[4072843]:  File "/home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 49, in send
Apr 30 12:51:26 puma2 cylc[4072843]:    await self.ws.write_message(data)
Apr 30 12:51:26 puma2 cylc[4072843]:  File "/home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/tornado/websocket.py", line 332, in write_message
Apr 30 12:51:26 puma2 cylc[4072843]:    raise WebSocketClosedError()
Apr 30 12:51:26 puma2 cylc[4072843]: tornado.websocket.WebSocketClosedError
Apr 30 12:51:26 puma2 cylc[4072843]: During handling of the above exception, another exception occurred:
Apr 30 12:51:26 puma2 cylc[4072843]: Traceback (most recent call last):
Apr 30 12:51:26 puma2 cylc[4072843]:  File "/home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 151, in on_start
Apr 30 12:51:26 puma2 cylc[4072843]:    await self.send_error(connection_context, op_id, e)
Apr 30 12:51:26 puma2 cylc[4072843]:  File "/home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/graphql_ws/base_async.py", line 181, in send_message
Apr 30 12:51:26 puma2 cylc[4072843]:    return await connection_context.send(message)
Apr 30 12:51:26 puma2 cylc[4072843]:  File "/home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/cylc/uiserver/websockets/tornado.py", line 49, in send
Apr 30 12:51:26 puma2 cylc[4072843]:    await self.ws.write_message(data)
Apr 30 12:51:26 puma2 cylc[4072843]:  File "/home/n02/n02/fcm/metomi/cylc-8.4.1-1/lib/python3.9/site-packages/tornado/websocket.py", line 332, in write_message
Apr 30 12:51:26 puma2 cylc[4072843]:    raise WebSocketClosedError()
Apr 30 12:51:26 puma2 cylc[4072843]: tornado.websocket.WebSocketClosedError

PUMA2 Apache setup:

<VirtualHost *:80>
    ServerName XXXXXXX

    ProxyPreserveHost On

    # Use RewriteEngine to handle WebSocket connection upgrades
    RewriteEngine On
    RewriteCond %{HTTP:Connection} Upgrade [NC]
    RewriteCond %{HTTP:Upgrade} websocket [NC]
    RewriteRule /(.*) ws://localhost:8000/$1 [P,L]

    # HTTP proxy to JupyterHub
    ProxyPass "/" "http://localhost:8000/"
    ProxyPassReverse "/" "http://localhost:8000/"

    RequestHeader     set "X-Forwarded-Proto" expr=%{REQUEST_SCHEME}

    # Long timeouts
    Timeout 3600
    ProxyTimeout 3600
    ProxyWebsocketIdleTimeout 3600
</VirtualHost>

ARCHER2 NGINX reverse proxy setup:

server {

    listen              80;
    server_name         XXXXXXXXX;

    location / {

        proxy_set_header HOST $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_pass http://XXXXXXXX;
        proxy_buffering off;
        proxy_http_version 1.1;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto https;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $http_connection;
     }
}

I’ve been told they have raised the proxy_read_timeout and proxy_send_timeout to 300 seconds; I don’t have access to reverse proxy to check what exactly has been done.

Any ideas on how to track down where the problem is occurring or what the issue maybe greatly received. It’s so annoying as we are so very nearly there, but the frequent disconnects cause the cylc hub to spawn new cat-log & tail processes each time which brings PUMA2 to a grinding halt within hours. :unamused_face:

Has anyone else managed to successfully setup Cylc Hub behind a reverse web proxy?

Our Cylc versions are:

Cylc 8.4.1-1
Cylc UI 2.7.0
Cylc Hub 5.2.1

(@Rosalyn_Hatcher - sorry I can’t help with this yet; at my site we’ve postponed hub deployment till the new HPC, which is just about coming available now … maybe the MO team can advise).

Yes! This is almost certainly being caused by a proxy (note, there may be more than one involved).

Some web proxies implement a default timeout for idle websocket connections. The Cylc UI Server uses websockets to communicate with browser sessions. Data is only sent down the websocket when something changes, so during a short period of inactivity, the connection may appear to be idle to the proxy resulting in the connection being killed.

Unfortunately, I’m not familiar with NGINX configuration, ProxyWebsocketIdleTimeout sounds like it should do the job. Perhaps this needs to be squared with Apache too? Jupyter Hub provide some advice for configuring NGINX, however, it doesn’t look like that covers websocket timeout.


An alternative to working around proxy timeouts is to get the server to send regular heartbeat pings to the client every so many seconds in order to keep the connection active. Unfortunately, the Tornado webserver that Jupyter uses has a broken implementation for detecting websocket ping timeouts (causes false positives) which prevents us from using this feature at the moment. We have contributed a fix to Tornado which should arrive in Tornado 6.5.0, we will configure a default ping interval in the next Cylc UI Server release once this is available. I can’t provide a time estimate for the next Tornado release, but, I would hope it appears in the next few months.

In the mean time, you could try configuring a ping interval with a very long timeout, e.g:

# <cylc-config-root>/uiserver/jupyter_config.py

c.ServerApp.websocket_ping_interval = 400
c.ServerApp.websocket_ping_timeout = 999999

This will result in false positives on web app startup, but should work ok thereafter.


the frequent disconnects cause the cylc hub to spawn new cat-log & tail processes each time which brings PUMA2 to a grinding halt within hours.

We have a partial solution to this problem in the form of a cylc cat-log timeout - UI Server Configuration — Cylc 8.4.2 documentation

Hi Oliver,

EPCC took another look at their end and turns out they have HAProxy sitting in front of their NGINX reverse proxy and it was the timeout on HAProxy server that was causing the issues. So they’ve extended it a bit, but not much at present.

As per your suggestion I have configured a heartbeat ping for just under the HAProxy timeout and that seems to be doing the trick; no disconnections yesterday. :tada: So I think we now have a working solution for this. :grin:

I’ve also configured c.CylcUIServer.log_timeout

In a bid to try and simplify our setup I’ve taken Apache out of the equation as well.

Thanks for your help with this.

Cheers,
Ros.

1 Like

Great, I’m glad that’s sorted.

These timeouts have caused issues for a couple of deployments, it can be a right pain working out which proxy is causing the problem. Once the next Tornado release is out, well configure a default heartbeat ping in the hope of avoiding the need to mess with proxy configurations.

Cheers,
Oliver