Slow load of Cylc workflows, disconnects

Hello,
We are a user of Cylc 8.2 in a multi-user Hub configuration that is currently seeing an issue with the loading of workflows. It can sometimes take up to 2 minutes or so for our users to be able to load workflows. In that time we see disconnects from the cylc server (red banner) for the user in question and for other users. Has this behavior been recorded or noticed by other users? Here is a sample of the cyclic graph/workflow involved. Any advice or help would be appreciated.

[task parameters]
    member = {{ MEMBERS }}
    fcsthr = 0..{{ CUTOFF_1HOURLY }}, \
       {{ CUTOFF_1HOURLY + 3 }}..{{ CUTOFF_3HOURLY }}..3, \
           {{ CUTOFF_3HOURLY + 6 }}..{{ END_FCSTHR }}..6
  [[templates]]
    member = member%(member)03d
    fcsthr = _fcsthr%(fcsthr)03d

[scheduling]
  initial cycle point = {{ START_CYCLE }}
  final cycle point =  {{ END_CYCLE }}
  runahead limit = P3
  [[special tasks]]
    clock-expire = PREPROCESS(PT18H), MEMBERS(PT18H), HOUSEKEEP(PT18H)
  [[xtriggers]]
    start = wall_clock(offset=PT7H15M)
  [[graph]]
    T00,T06,T12,T18 = """
        @start & prune[-PT6H]:finish => prune & purge
        @start => sniffer:ready_<member,fcsthr> => <member,fcsthr>_process? => finish
        <member,fcsthr>_process:fail? => fault
      """

Hi @dhladky

I think you are describing a known issue, not with Cylc UI but in (as far as we can tell) Firefox internals for establishing a wss (web socket) connection.

Once established, the wss connection persists so it’s only a problem at start-up - when the UI first connects and requests the “gscan” left side panel data to list your workflows.

Are you using Firefox? And if so, can you try another browser to compare?

The issue has recently been replicated on Chrome and Edge - see https://github.com/cylc/cylc-ui/issues/1200

To explain what’s going on here, in some situations it takes multiple attempts for the GUI to connect to the server which provides the data. Often refreshing the page makes the problem go away. The issue is not related to the number or complexity of the workflows you have installed / running and can be replicated with no workflows at all. The cause of the issue is unclear, but it would appear not to be bue to a particular browser or operating system.

Hillary,
We are using Chrome in most cases. We’ve tried Firefox but found better results with Chrome.

Oliver,
I saw the mention of wss. Is that a contributing factor? If so this may give us a clue regarding our web proxy setup. Any idea why it would cause other users to disconnect?

Technical information if interested…

The issue is in establishing the websocket connection when the GUI is loaded (either ws or wss depending on how you deploy cylc-uiserver). The connection fails with code 1006 before it is established. When the connection attempt fails, the GUI will retry indefinitely (with an increasing interval between attempts) until connected successfully which is why the issue sometimes remedies itself after a period of time.

The cylc-uiserver (backend) does not appear to receive any connection attempt suggesting that the issue is browser side, but web proxy issues aren’t out of the question, if you’re able to access proxy logs it would be good to know if the websocket connection attempts appear in it.

You can view the failed connection attempts by opening the developer console in your web browser. Each failed connection attempt is logged at the error level. If you search the error message your browser gives you, you’ll find a large number of reports of this issue in other systems with a variety of suggested fixes, I haven’t got any to work so far.

We have seen the issue more often with Firefox than other browsers, and more often on loaded systems than idle ones, however, the issue can be extremely intermittent so this may be coincidence.

Thanks Oliver, yes very interested in the technical aspects here. Our users have pinpointed this as a particularly annoying issue as we use VDI and go through a proxy to get to our Cylc servers. I have access to proxy logs. Would getting you proxy logs (redacted of course) help you if we were to provide at all?

I’m not too interested in seeing the full logs per-se, but I would like to know whether the failed requests show up in the proxy logs at all. If they do, it would be helpful to know any status codes or errors associated with them.

The websocket endpoint is /subscriptions, it’s the only websocket endpoint we use, there’s only one websocket connection per GUI instance.

I would like to clear up some items. This is not an intermittent or “only on startup” issue. We have multiple workflows and this one in particular always takes a significant time to load (45 seconds). We’ve seen other workflows with similar issues. The longest I’ve seen is two minutes. If you let it load, then select a new workflow, then come back to this one, it’s going to be another 45 seconds or more.

Additionally, if we try selecting a different workflow while this workflow is loading, it breaks the GUI for ALL workflows for that user until that initial request finishes. This is repeatable.

Finally, this problem occurs regardless of whether we are going through the web proxy or are local to the network. And it is not browser specific.

By loading the workflow, do you mean clicking on it in the side panel to get the detailed workflow view, not loading its id and summary state into the side panel at UI start-up?

If so, what’s different about that particular workflow? Is it much bigger than the others?

Also, after the slow initial load, does the UI perform OK on that workflow?

Yes, clicking on the side panel to load the workflow into the main section.

This workflow isn’t substantially different than others but it is the largest. However, other large workflows used by other teams experience the same issue.

Once it loads, yes it seems reasonably responsive.

This is not an intermittent or “only on startup” issue. We have multiple workflows and this one in particular always takes a significant time to load

Hmm, this sounds like a different issue to what we thought the OP was about. Do you see the red “you are offline” banner while it is loading? In fact what you are describing sounds more like this, assuming you do see the red banner appear: Disconnects in Cylc UI logs

if we try selecting a different workflow while this workflow is loading, it breaks the GUI for ALL workflows for that user until that initial request finishes

What do you mean by “breaks the GUI”?

It would be helpful to provide some screen capture GIFs of what is happening, if you are able (or just screenshots).

No, we do not get the red banner while loading. It’s just a white background. Picture everything is normal except where the actual tree would be there’s nothing but empty white space (what you would normally see while it is loading).

When I say “breaks the GUI” I mean that it makes every workflow behave in a similar manner for every other person accessing that server owner. So let’s say I have four workflows running - A through D, where A through C all load within five seconds but D takes a minute. If I click on D and then try clicking on A, B, or C before D has finished loading, none of the other workflows will load normally for anybody (same result, just a white background in the main section). This appears to worsen with multiple people accessing it at the same time and isn’t as consistent in its behavior as the long load times.

OK, that’s helpful. It suggests that the data subscription to that UI Server is just taking a while to yield the first data burst. Maybe there are dropped connections and/or reconnection timeouts involved.

If it’s true that multiple users looking at the same UI Server see the same problem at the same time, that would seem to implicate either the UI Server itself, or perhaps the hub’s web proxy.

When loading a new workflow the first data burst is the biggest. That’s the full workflow state, after which incremental changes are fed to the UI. However, in Cylc 8 even the full state is typically far smaller than it would have been for the same workflow in Cylc 7, because of Cylc 8’s focus on only the active tasks and (by default) tasks one graph-edge around them. So it seems unlikely that the sheer volume of data would cause the problem.

Thanks for the clarifications.

On Load Connection Issues

The only connections issues that we have seen ourselves happen when the GUI is first loaded in the browser. It takes a while for the GUI to connect to the cylc-uiserver, during this time the red connection lost banner appears. Once the connection is successfully opened, it stays open with no further issues.

Workflow Load Connection Issues

You’re describing issues which occur after the GUI has successfully connected to the server which is something we have not seen ourselves and requires further investigation.

I think the symptoms described could have the following causes:

  • Connection/bandwidth issues between the browser and the server.
  • Connection/bandwidth issues between the server and the scheduler.
  • Over burdened browser running slow.
  • Over burdened server running slow.
  • Over burdened scheduler server running slow.
  • Code performance issue causing slow reply from the scheduler.
  • Code performance issue causing slow reply from the server.
  • Communication protocol breakdown causing received data to be rejected.

Terms:

  • GUI/Web App - the web page for monitoring/controlling workflows (cylc-ui)
  • Server/GUI Server - the bit that the web app connects to, this collates data from a users workflows (cylc-uiserver)
  • Scheduler - the bit that runs a workflow (cylc-flow)

Note, schedulers running using older versions of cylc-flow may exacerbate performance issues.

It would be good to rule out system load on the machines being used to run the servers, schedulers and browsers as a host with high CPU usage will naturally struggle to process data in a timely fashion.

It would also be useful to know whether the GUI servers are being run on the same machines as the schedulers.

It would also be helpful to know if any errors or warnings are associated with the issues:

  • For the web-app you can find these in the developer tools in the browser. Navigate to the “Console” tab any errors / warnings should be highlighted in red / orange.
  • For the server, you can find the logs in the terminal where you started it, or in ~/.cylc/uiserver/log.
  • For the scheduler, you can find the logs in ~/cylc-run/<workflow-id>/log/scheduler/log.

The Connection

There are two connections involved in viewing a workflow in the GUI, the first is between the app running in the browser and the server, the second is between the server and the scheduler that is running the workflow.

Web App (cylc-ui) —> Server (cylc-uiserver) —> Scheduler (cylc-flow)

The first thing we need to work out is which of the connections is causing the issue. Unfortunately, there isn’t an especially easy way to find that out at present. If we work out where to stick debugging info into the code, would you be able to test the code straight from GitHub?

The Workflow

The workflow you commented above involves a parametrisation:

<member,fcsthr>

Could you give us an idea of the length of the member and fcsthr lists. This will give us an idea of the number of tasks involved.

Thanks for the detailed info. Since we do not control the system involved, we’ll have to work with our partners to gather some of this info. Here’s what I can answer:

Yes, the cylc-uiserver is running on the same system as the scheduler.

When I click on the workflows in the web GUI, I do not see an appreciable change while running an mpstat on the server running the uiserver/scheduler.

There are no errors in the most recent log file in ~/.cylc/uiserver/log nor ~/cylc-run//log/scheduler.

The workflow I referenced has 10 members and roughly 100 forecast hours.

Good news, using those numbers, I think I have replicated the problem.

Here’s my adapted version of the workflow in the original post:

#!Jinja2

{% set members = 10 %}
{% set hours = 100 %}

[scheduler]
    allow implicit tasks = True

[task parameters]
    member = 0..{{members}}
    fcsthr = 0..{{hours}}
  [[templates]]
    member = member%(member)03d
    fcsthr = _fcsthr%(fcsthr)03d

[scheduling]
  initial cycle point = 2000
  runahead limit = P3
  [[xtriggers]]
    start = wall_clock(offset=PT7H15M)
  [[graph]]
    T00,T06,T12,T18 = """
        @start & prune[-PT6H]:finish => prune & purge
        @start => sniffer:ready_<member,fcsthr> => <member,fcsthr>_process? => finish
        <member,fcsthr>_process:fail? => fault
      """

[runtime]
    [[sniffer]]
        [[[outputs]]]
{% for member in range(0, members + 1) %}
    {% for hour in range(0, hours + 1) %}
            ready_member{{ member | pad(3, 0) }}_fcsthr{{ hour | pad(3, 0) }} = {{ member }}{{ hour }}
    {% endfor %}
{% endfor %}

When I try to navigate to it in the GUI, the tree view remains blank for a while, the page freezes and Firefox displays the “scripts on this page are causing your browser to run slowly” message. I didn’t suffer any disconnects though. Does this match what you are seeing?

The Cause

I think the cause of this is nothing to do with the connection, but simply the load that displaying this workflow puts on the web browser and possibly the server.

~10 members X ~100 forecast hours results in a ~1000 task matrix, there are two of these matrices (one for the sniffer task outputs and one for the *_process tasks) and three cycles (P3 runahead limit), so potentially ~6000 tasks/outputs for the GUI to process/display. When you start pushing into the high thousands of tasks the web browser is going to start struggling.

The GUI is powered by a stream of continuous updates sent by the server. Each has to be applied in turn, the corresponding graphical elements are then generated from this data. If opening one workflow causes issues with other workflows this suggests that the updates for the larger workflow may be flooding the server.

Workarounds

Host

Running this workflow causes a fair amount of load on my box. With the schedulers and servers both running on the same host it might be feeling the strain compounding the problem. So it’s definitely worth checking the CPU pressure on this host, if it’s high consider running the server(s) on a different host to give them a fighting chance.

View

There are two things that will cause the browser to run slowly for this example:

  1. The amount of data it has to process.
  2. The number of items it has to display.

We can’t do anything about the first issue at present, however, we can reduce the number of tasks being displayed by switching from the “Tree” view to the “Table” view. as the “Table” view paginates results, you can set it as the default from the settings page.

Workflow

The most effective solution is to reduce the matrix size, E.G. group together tasks from multiple members or forecast hours. I appreciate this introduces artificial dependencies which you might rather avoid as I’m guessing this is a post-processing workflow that feeds off of data coming from somewhere else?

Fixes

There may be things we can do to reduce the impact of this workflow on the server / GUI.

Now we have an example to profile against, we can see what parts of the system it’s stressing and what optimisations we can perform.

First, thank you so much for your attentiveness to our issue. In case it’s unclear, this is a UM-based ensemble workflow - so post processing jobs being fed by the forecast output. Bundling either members or forecast hours isn’t really viable, though we had intended to split up members into multiple workflows.

Your description is accurate so it seems like you have reproduced what we are seeing, though we’re using Chrome which sometimes generates the “wait” dialogue and sometimes does not. The server running the uiserver/workflow is part of our HPC environment and I would be surprised if it is at all saturated since it is dedicated to this purpose.

For what it’s worth, this same workflow (suite) runs fine in Cylc 7, but I imagine that is to be expected since the underlying infrastructure has changed. Just out of curiosity, has an on-system GUI front-end been considered at all? (e.g. Tkinter)

Any potential optimizations would be most welcome. Please let me know f there’s any additional information I can pro.vide to that end