Cylc UI server memory requirements

I’m running the Cylc hub, and all user UI servers on one VM, with the actual Cylc workflows running on other VMs.

My Hub VM is x86_64, has 8 CPUs, 32GB of memory and 8GB of swap. It runs the hub, sudospawners and hubapps as user processes, as well as fairly normal OS stuff (including auditd, kauditd, fapolicyd).

We keep running out of swap and the VM system administrators are blaming python/Cylc. This is not entirely unreasonable, on equivalent servers where I’m only running the hub but have no users, we’re never running out of swap.

Tasks: 328 total,   3 running, 325 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.6 us,  0.9 sy,  0.0 ni, 96.8 id,  0.5 wa,  0.2 hi,  0.2 si,  0.0 st
MiB Mem :  31928.6 total,    282.6 free,  30663.4 used,    982.6 buff/cache
MiB Swap:   8192.0 total,    413.4 free,   7778.6 used.    829.0 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                    
  15531 user1     20   0   10.5g   4.1g   8888 S   3.6  13.1 230:01.01 python                                                                                     
  23034 user2     20   0 6192940   4.4g   7296 S   3.0  14.0  53:24.01 python                                                                                     
   1430 root      16  -4   75128   1060    772 D   2.6   0.0 110:40.08 auditd                                                                                     
   2721 user3     20   0   11.4g   6.3g      0 S   1.7  20.3  86:52.76 python                                                                                     
   3451 user4     20   0 2777116 253700      0 S   1.3   0.8  53:02.96 python                                                                                     
  26112 user5     20   0 1792924 129116      0 S   1.3   0.4  21:53.31 python                                                                                     
   1472 fapolic+   6 -14  312252  41004   9128 S   0.7   0.1  33:40.86 fapolicyd                                                                                  
   2807 user6     20   0 2872276 444260      0 S   0.7   1.4  55:30.52 python                                                                                     
  11427 user7     20   0 1863620 189160   2152 S   0.7   0.6  17:25.85 python                                                                                     
 111515 user8     20   0 5940704   1.5g  10784 S   0.7   4.8  45:05.81 python                                                                                     
     69 root      20   0       0      0      0 R   0.3   0.0  13:47.13 kauditd
...

Is this to be expected? We had a smaller VM where users ran cylc gui, cylc gscan and workflows for Cylc 7 (6 CPUS, 20GB memory, 4GB swap) and never had the same problems with swap, but we also weren’t running a mini webserver for each user. Here’s an example snapshot from top for one of our Cylc 7 hosts.

Tasks: 370 total,   3 running, 366 sleeping,   0 stopped,   1 zombie
%Cpu(s): 24.0 us,  7.1 sy,  0.0 ni, 59.3 id,  9.0 wa,  0.0 hi,  0.6 si,  0.0 st
KiB Mem : 20393736 total,   718976 free,  7174120 used, 12500640 buff/cache
KiB Swap:  4194300 total,  4095228 free,    99072 used. 11847216 avail Mem

Has anyone done any load testing? Given I would like to be hosting ~ 70 users on this host, how much memory should I expect to need? If it makes a difference, assume on average 20 Cylc workflows per user.

On a different host, I expect to have many users (could be as high as 100) using the hub who don’t have any workflows, looking at UI Servers owned by a smaller set of users (~ 10-15) (so 115 UI servers, but only 15 needing to generate graphql etc). The smaller set of users will probably own at most 20 workflows each. I would welcome any suggestions on what level of provisioning I might need for that host.

Hi,

It’s very hard for us to quantify the memory requirements of a Cylc UI Server instance as these requirements scale in according to its usage. The memory requirement to view one workflow may be orders of magnitude higher than that of another workflow. The number of browser tabs connected to the server is also a factor, as is the number of different workflows open in these tabs and the particular “views” (i.e. Cylc UI tabs) open on those workflows.

I have not yet seen a Cylc UI Server process using 1GB of memory (RSS), let alone 6.3GB. I would not expect memory usage to get this high. There are multiple things that could cause high memory usage, so some investigation will be needed to determine the cause.

The first thing to check is how long these servers have been running. At our site, we have not yet deployed Jupyter Hub so our users are still spawning their Cylc UI Server manually (cylc gui). This (combined with local policies) means that our UI Server processes are relatively short lived, whereas, in a Jupyter Hub setup the servers will likely stay up for much longer periods (server patching regime permitting). So we might not yet have seen issues related to gradual memory accumulation at our site. You could experiment by getting “user3” to restart their server (navigate to the “hub” section in the Cylc dashboard page and press “Stop Server”). When the server restarts, the UI sessions should reconnect and resume their subscriptions. If the restarted server doesn’t return to its previous level of memory usage, then it’s likely a resource management problem which gives us something to work with. There are various things we could do in the UI Server to help bring this under control.

If server “uptime” is a contributing factor, as a workaround, you might want to look into using a Jupyter Hub service to cull servers according to a schedule. They provide a service to do this (for idle servers) which might be of interest (I haven’t looked into how they determine if a server is “idle” or not) - GitHub - jupyterhub/jupyterhub-idle-culler: JupyterHub service to cull idle servers and users. Note culled servers will be automatically restarted whenever a browser tab attempts to contact them.

If it’s not “uptime” related, we might need to look into:

  • What workflows are running on accounts that are showing high UI Server memory usage?
    • Any commonality in workflows between user accounts with high UIS memory usage?
  • How many browser tabs the user has open?
    • Does the memory usage go down when tabs are closed?
    • When the server is restarted, how does memory scale as new tabs are opened?
  • What views does the user have open?
    • I’m not expecting this to be a major factor, but worth checking.

O


PS: Note, the “virtual” memory reported by tools like ps is the virtual address space not the actual amount of memory used (RSS). Multi process/threaded Python apps generally report high “virtual” memory due to the way process/threads are created (via forks), this number is not of concern as these process/threads will never “occupy” this virtual space. However, it is worth noting that we may remove the need for threading in a near-future cylc-uiserver release which should bring this virtual number down.

I’ll just echo @oliver.sanders statement that it’s difficult for the development team to characterize Cylc resource requirements for “real” workflows, given Cylc’s distributed nature, the extreme diversity of its use cases, and the (non-Cylc) resources needed to run real workflows (sometimes involving local tasks on the scheduler host). That’s part of the reason we tried to encourage early (several years back!) Cylc 8 engagement and testing by all stakeholders, but you know how that goes :slight_smile:

In principle (which matters, particularly in the medium and longer term) Cylc 8 should be much more efficient and scalable than Cylc 7, in all the ways that matter. But its possible that we haven’t shaken out a memory leak or two yet.

I did get a local report of a UI Server apparently using 4 GB a while back; it had been running for a 4 weeks or more, and was an older version. In the event of a slow memory leak, as Oliver noted, the temporary workaround is not too painful: just kill them after a while, they get restarted on the fly as needed.

I can alleviate that concern! You’ll only have 10-15 UI Servers for all those users. The UI Servers, like the schedulers, run as the workflow owner-user.

(I’m not sure if that applies to the web UI vs the old desktop GUI, but no one wanted a re-do of the old GUI, and on the plus side for the web UI the more efficient scheduler means there are typically many fewer extraneous tasks to display, and the incremental data subscriptions greatly reduce the network traffic and the load on the schedulers. Plus the web tech has many other advantages of course).

So we saw these values creeping up again:

Tasks: 334 total,   2 running, 332 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.1 us,  2.4 sy,  0.0 ni, 94.5 id,  0.5 wa,  0.2 hi,  0.2 si,  0.0 st
MiB Mem :  31928.6 total,   4334.8 free,  26499.4 used,   1094.4 buff/cache
MiB Swap:   8192.0 total,   3174.4 free,   5017.6 used.   4946.2 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                       
 208218 user1     20   0   11.6g   5.9g   8724 S   7.3  19.0 317:54.38 python                                                                                                                        
 207833 user2     20   0    9.9g   6.9g   8164 S   5.6  22.2  91:52.75 python                                                                                                                        
   1430 root      16  -4   75252    708    380 R   3.6   0.0 206:03.38 auditd                                                                                                                        
 208345 user3     20   0   13.1g   5.9g   8748 S   3.3  19.0 138:13.61 python

User1 has 30 workflows running (some are stalled), 2 paused and 38 stopped.
User3 has 12 workflows running (some are stalled), 1 paused and 19 stopped.

Restarting User1’s server took a while, but they did not return to a similar memory footprint even after all of their workflows reloaded:

Tasks: 328 total,   1 running, 327 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  1.4 sy,  0.0 ni, 95.8 id,  0.5 wa,  0.2 hi,  0.2 si,  0.0 st
MiB Mem :  31928.6 total,   9691.7 free,  21132.5 used,   1104.4 buff/cache
MiB Swap:   8192.0 total,   3978.9 free,   4213.1 used.  10313.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                      
   1430 root      16  -4   75252    708    380 D   3.6   0.0 206:22.53 auditd                       
 207912 user4     20   0 2904304 710892   7896 S   2.6   2.2  55:37.39 python                       
 461424 user1     20   0 5803608 541016  24132 S   2.6   1.7   0:12.84 python

Even waiting a good while, user1’s usage hasn’t returned to the previous high.

Tasks: 320 total,   3 running, 317 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.6 us,  1.1 sy,  0.0 ni, 94.5 id,  0.5 wa,  0.2 hi,  0.2 si,  0.0 st
MiB Mem :  31928.6 total,   9494.7 free,  21288.7 used,   1145.2 buff/cache
MiB Swap:   8192.0 total,   4417.4 free,   3774.6 used.  10155.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                      
 207912 user4     20   0 3057640 898324   7896 S  17.9   2.7  56:50.19 python                       
 461424 user1     20   0 5827940 559088  24132 S   3.3   1.7   1:57.57 python  

I’ll investigate the idle-culler. Thanks for the suggestion.

Thanks for trying that out, I expect the UI Servers were accumulating data from previous user activity. There are some things we can do to reduce what we hold and release data where possible.