I have 7 workflows running on a shared linux machine. I use the UI server to monitor the jobs. However, recently, it takes forever to load and in the log I see repetitive “ClientTimout” errors. The error mentions a --comms-timeout option, but I don’t understand where to set this option. Also, I’m not sure which log to check, as told by the error message. For the “dist-cem” workflow (see below) there is no error in the “log/scheduler/log” file. From the log, it seems the workflow runs smoothly.
Log example:
[I 2024-08-05 10:36:51.021 CylcUIServer] [data-store] disconnect_workflow(‘~prod2/dist-cem/run1’)
[I 2024-08-05 10:36:51.082 CylcUIServer] [data-store] connect_workflow(‘~prod2/dist-cem/run1’, )
[E 2024-08-05 10:36:56.653 CylcUIServer] ClientTimeout: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[E 2024-08-05 10:36:56.655 CylcUIServer] Failed to update entire local data-store of a workflow: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[I 2024-08-05 10:36:56.656 CylcUIServer] failed to connect to ~prod2/dist-cem/run1
[I 2024-08-05 10:36:56.656 CylcUIServer] [data-store] disconnect_workflow(‘~prod2/dist-cem/run1’)
[I 2024-08-05 10:36:56.779 CylcUIServer] [data-store] connect_workflow(‘~prod2/dist-cem/run1’, )
[E 2024-08-05 10:37:02.219 CylcUIServer] ClientTimeout: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[E 2024-08-05 10:37:02.220 CylcUIServer] Failed to update entire local data-store of a workflow: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[I 2024-08-05 10:37:02.220 CylcUIServer] failed to connect to ~prod2/dist-cem/run1
[I 2024-08-05 10:37:02.221 CylcUIServer] [data-store] disconnect_workflow(‘~prod2/dist-cem/run1’)
[I 2024-08-05 10:37:02.314 CylcUIServer] [data-store] connect_workflow(‘~prod2/dist-cem/run1’, )
[E 2024-08-05 10:37:07.764 CylcUIServer] ClientTimeout: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[E 2024-08-05 10:37:07.765 CylcUIServer] Failed to update entire local data-store of a workflow: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[I 2024-08-05 10:37:07.766 CylcUIServer] failed to connect to ~prod2/dist-cem/run1
[I 2024-08-05 10:37:07.766 CylcUIServer] [data-store] disconnect_workflow(‘~prod2/dist-cem/run1’)
[I 2024-08-05 10:37:07.851 CylcUIServer] [data-store] connect_workflow(‘~prod2/dist-cem/run1’, )
[E 2024-08-05 10:37:13.301 CylcUIServer] ClientTimeout: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[E 2024-08-05 10:37:13.304 CylcUIServer] Failed to update entire local data-store of a workflow: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[I 2024-08-05 10:37:13.304 CylcUIServer] failed to connect to ~prod2/dist-cem/run1
[I 2024-08-05 10:37:13.305 CylcUIServer] [data-store] disconnect_workflow(‘~prod2/dist-cem/run1’)
[I 2024-08-05 10:37:13.367 CylcUIServer] [data-store] connect_workflow(‘~prod2/dist-cem/run1’, )
[E 2024-08-05 10:37:18.941 CylcUIServer] ClientTimeout: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[E 2024-08-05 10:37:18.945 CylcUIServer] Failed to update entire local data-store of a workflow: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[I 2024-08-05 10:37:18.945 CylcUIServer] failed to connect to ~prod2/dist-cem/run1
[I 2024-08-05 10:37:18.945 CylcUIServer] [data-store] disconnect_workflow(‘~prod2/dist-cem/run1’)
[I 2024-08-05 10:37:19.019 CylcUIServer] [data-store] connect_workflow(‘~prod2/dist-cem/run1’, )
[E 2024-08-05 10:37:24.622 CylcUIServer] ClientTimeout: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[E 2024-08-05 10:37:24.625 CylcUIServer] Failed to update entire local data-store of a workflow: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.
[I 2024-08-05 10:37:24.625 CylcUIServer] failed to connect to ~prod2/dist-cem/run1
[I 2024-08-05 10:37:24.625 CylcUIServer] [data-store] disconnect_workflow(‘~prod2/dist-cem/run1’)
Opening the server in the morning was always slow, but it seems these repetitive errors started when I upgraded from 8.2.3 to 8.3.3.
The machine does not look loaded at all. I am in the same LAN, so I don’t except network problems, but it is shared with 10ish other people and we have seen slow I/O in the past.
We haven’t seen client timeouts at the UI Server before, I’ll take a look to see how this could occur.
The comms timeout determines how long a Cylc client will spend waiting for a response from the scheduler that is running a workflow. The default timeout is 5 seconds which should be sufficient for must purposes. There is no central configuration for increasing this because it shouldn’t need to be changed.
If a workflow is taking more than 5 seconds to respond, then:
The workflow is absolutely massive (haven’t seen this impact the UI Server yet).
Try monitoring the workflow with cylc subscribe <workflow-id> -T all to see if this is the problem.
Something is fishy with the network.
Hard to diagnose, try a simple connection from where the UI Server is running to where the scheduler is running.
The connections are actually failing, but being reported as timeouts (because it can be hard to tell the difference).
Try running cylc ping <workflow-id> --comms-timeout=60. If this works, the connection is ok, reduce the timeout until it times out.
Worth making sure you are running with the latest Cylc UI Server version.
[prod2@neree ~]$ cylc subscribe dist-cem -T all
dist-cem is not running
dist-cem is not running
dist-cem is not running
^C
[prod2@neree ~]$ cylc play dist-cem
Resuming already-running workflow
workflow contact file exists: /home/prod2/cylc-run/dist-cem/run1/.service/contact
Workflow "dist-cem/run1" is already running, listening at "<redacted>:43014".
To start a new run, stop the old one first with one or more of these:
* cylc stop dist-cem/run1 # wait for active tasks/event handlers
* cylc stop --kill dist-cem/run1 # kill active tasks and wait
* cylc stop --now dist-cem/run1 # don't wait for active tasks
* cylc stop --now --now dist-cem/run1 # don't wait
* ssh -n "<redacted>" kill 49897 # final brute force!
And no, for “dist-cem”, the UI never shows anything. I also realized I must have messed up the “runahead” because I have 50 years of monthly tasks running currently (as seen with cylc workflow-state). The only task running is a file-existence check that repeats every hour and currently fails (model has not simulated that yet).
I think I will stop, clean and re-start the workflow, making sure the runahead limit is properly set.
Other workflows do load in the UI, but it takes time (> 30 s). For some time (~2 min) this morning the firefox tab was at 100% cpu using over 2GB of ram! Although this might be only related to the runhead issue that should affect 3 workflows.
You were right. Adding /run1 makes the command work.
After stopping, cleaning and restarting the workflows with a P1Y runahead (12 cycle points), everything seems to run smoothly.
It does seem that at least on my machine, cylc didn’t scale as expected ? I’m not sure how to diagnose this loss of performance. With htop I saw that, for all morning, CPU was under 500% (out of a theoritical max of 6 400%) and memory was under 30 GB / 252 GB. Maybe this is an I/O issue ? All the buggy workflows run on local disks though. The run directory is a symlink to another disk than home, but still local. Maybe network issues within the LAN ? I’ll ask the IT guy here if this morning was abnormally loaded.
There are many dimensions for Cylc to scale in (e.g. number of cycles, number of tasks, rate of task events, number of views open in the GUI, n-window selection, etc).
Pushing up the number of cycles can result in a lot of data transfer, especially if there is a large array of tasks immediately upstream or downstream of the active tasks in each cycle or if
the n-window number in the GUI has been increased.
It would be good for us to see the workflow config that caused this bad scaling so that we can target performance optimisations for it in the future. If your workflow is open source it would be great for us to see the configuration. Otherwise if you’re able to provide a vague breakdown of the [scheduling] section, we don’t need the [runtime] section, or real task names, just the general shape of the graph.
From what you’ve reported, I don’t think that the filesystem you symlink workflow installation onto is likely to have been the cause, although note, you can link different bits of the workflow installation into different locations to keep the Cylc files local - see symlink dirs.
I suspect from what you’ve described that the Cylc scheduler was not able to calculate the list of cycles within 5 seconds (which is concerning), so the connection timed out causing the UI Server to retry the request, causing a request-timeout-retry loop. This would have caused a 100% CPU load on the scheduler. The scheduler would have continued to schedule tasks as normal due to its design.