Disconnects in Cylc UI logs

All,
Don’t want to conflate this issue with the other issue we are having involving slow loads of workflows.

Another issue we are seeing with our multi-user hub v.8.2.3 (UISever v1.4) is that when a user will click on anything in the GUI. On occasion the GUI will disconnect from workflows (The red banner) and other users on that same cylc server will see themselves disconnected. it is sort of a luck of the draw whether or not you get reconnected from what I am hearing in so far as recovery. Not entirely sure how this manifests itself in the log but I have gathered an instance that has something that might look out of place. I don’t want to state the 2 issues are related but they do share the disconnects as a symptom. This log segment below shows that something might be up with the proxy connection we traverse but I am no expert at the Hub UI. Does anyone in the Cylc user group have experience with using Cylc via forwarded web proxy connections? Does the below disconnect point to anything in particular?

2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: future:
<Task finished name='Task-16126'
coro=<TornadoSubscriptionServer.on_start() done, defined at /autofs/nccs-svm1_afw_sw/afw-system/cylc/8.2.3/lib64/python3.9/site-packages/cylc/uiserver/websockets/tornado.py:127>
exception=WebSocketClosedError()>
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: Traceback (most recent call last):
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/cylc/uiserver/websockets/tornado.py",
line 146, in on_start
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: await self.send_execution_result(
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/cylc/uiserver/websockets/tornado.py",
line 166, in send_execution_result
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: await super().send_execution_result(connection_context, op_id, execution_result)
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/graphql_ws/base_async.py",
line 189, in send_execution_result
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: await super().send_execution_result(connection_context, op_id, execution_result)
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/graphql_ws/base_async.py",
line 181, in send_message
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: return await connection_context.send(message)
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/cylc/uiserver/websockets/tornado.py",
line 47, in send
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: await
self.ws.write_message(data)
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/tornado/websocket.py",
line 331, in write_message
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: raise
WebSocketClosedError()
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]:
tornado.websocket.WebSocketClosedError
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: During handling of the above exception, another exception occurred:
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: Traceback (most recent call last):
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/cylc/uiserver/websockets/tornado.py",
line 149, in on_start
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: await self.send_error(connection_context, op_id, e)
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/graphql_ws/base_async.py",
line 181, in send_message
2023-11-21T01:09:52.198927+00:00 cylcXX cylc-hub.sh[176421]: return await connection_context.send(message)
2023-11-21T01:09:52.199959+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/cylc/uiserver/websockets/tornado.py",
line 47, in send
2023-11-21T01:09:52.199959+00:00 cylcXX cylc-hub.sh[176421]: await
self.ws.write_message(data)
2023-11-21T01:09:52.199959+00:00 cylcXX cylc-hub.sh[176421]: File "/autofs/storageXX/cylc/8.2.3/lib64/python3.9/site-packages/tornado/websocket.py",
line 331, in write_message
2023-11-21T01:09:52.199959+00:00 cylcXX cylc-hub.sh[176421]: raise
WebSocketClosedError()
2023-11-21T01:09:52.199959+00:00 cylcXX cylc-hub.sh[176421]:
tornado.websocket.WebSocketClosedError
2023-11-21T01:10:00.765346+00:00 cylcXX cylc-hub.sh[168504]: [D
2023-11-21T01:10:00.765 JupyterHub proxy:880] Proxy: Fetching GET
http://127.0.0.1:8001/api/routes--

We haven’t had any reports of this problem before, sorry, as far as I’m aware.

Our main sites are still using Cylc 8 in single user (hub-less) mode, although that will change before too long. From your traceback, the error is happening inside cylc-uiserver, but I suppose it could possibly be caused by a flaky web proxy (via the hub).

On occasion the GUI will disconnect from workflows (The red banner) and other users on that same cylc server will see themselves disconnected.

By “the same cylc server” I presume you mean the same Cylc UI Server - i.e., you have multiple users looking at the same target user’s workflows?

Correct,
We can have as many as 6 folks using the UI server hub at any given time. Some of the “Production” workflows will get more than one user looking at it.

OK. Well, to state the obvious, I suppose we need to figure out why the connection is getting dropped, and if we’re doing the right thing to reconnect when if/when it does get dropped.

You said this happens “on occasion” - how frequently is that? (Hopefully infrequent!)

Unfortunately I am sending this on behalf of the folks that are actually operating the workflows so it is coming second hand, but from the feedback I am receiving it happens fairly frequently.

Just to rule things out, does the problem occur if you run hub-less?
i.e directly connecting to the UI Server via $ cylc gui

Since we have many people using Cylc for different workflows, multi-user. We choose to deploy the jupyter hub with UI Server to fit our needs. It’s not required but having a web browser as the client GUI is preferred as we have had less than stellar success in trying to use X11 forwarding. So, we have tried running hub-less but it wasn’t great.

@dhladky - that is absolutely what the hub is for, and we’re all going there (it’s just timing - some sites, at the moment, still have less of an imperative to migrate from Cylc 7 quickly).

However, note that hubless “single-user” mode does not require X11 forwarding - which can indeed be horrible. It’s simply that you’re not using the central hub to start UI Servers, and there is no web proxy in the middle.

The cylc gui command starts a UI Server and prints a URL to the terminal, which the user (given the right network access, of course) can connect to from their local browser.

@D_Sutherland has done some initial investigation, and at this stage it looks (partly from your traceback above) as if the dropped connections are not caused by Cylc code, so it might have to be debugged at your end.

However, that’s a tentative conclusion at best. And even if the root cause is not inside Cylc, it might be the case that Cylc could handle it better. Next week I’ll try to do some load testing with big workflows on our HPC, as opposed to my usual development box, and the hub.

Can you give some idea of the size and complexity of your workflows? Send a direct message to me if you like (Discourse allows that).

Note that to see how Cylc itself behaves, in terms of workflow evolution or UI performance etc., it can be easier to run real workflows in dummy mode (which replaces the real jobs with local dummy jobs) or simulation mode (which does not submit real jobs at all). To Cylc, it doesn’t matter what the jobs do, so long as the right outputs are generated.

For more nuanced dummy testing you can do things like comment out all script items, put (e.g.) script = sleep 10 in the root family, and override that with script = false to cause particular tasks to failure, or with scripting to generate certain custom outputs.

Also, to share a workflow with others for this kind of purpose you can just delete all the script config items (and any sensitive content).