Good news, using those numbers, I think I have replicated the problem.
Here’s my adapted version of the workflow in the original post:
#!Jinja2
{% set members = 10 %}
{% set hours = 100 %}
[scheduler]
allow implicit tasks = True
[task parameters]
member = 0..{{members}}
fcsthr = 0..{{hours}}
[[templates]]
member = member%(member)03d
fcsthr = _fcsthr%(fcsthr)03d
[scheduling]
initial cycle point = 2000
runahead limit = P3
[[xtriggers]]
start = wall_clock(offset=PT7H15M)
[[graph]]
T00,T06,T12,T18 = """
@start & prune[-PT6H]:finish => prune & purge
@start => sniffer:ready_<member,fcsthr> => <member,fcsthr>_process? => finish
<member,fcsthr>_process:fail? => fault
"""
[runtime]
[[sniffer]]
[[[outputs]]]
{% for member in range(0, members + 1) %}
{% for hour in range(0, hours + 1) %}
ready_member{{ member | pad(3, 0) }}_fcsthr{{ hour | pad(3, 0) }} = {{ member }}{{ hour }}
{% endfor %}
{% endfor %}
When I try to navigate to it in the GUI, the tree view remains blank for a while, the page freezes and Firefox displays the “scripts on this page are causing your browser to run slowly” message. I didn’t suffer any disconnects though. Does this match what you are seeing?
The Cause
I think the cause of this is nothing to do with the connection, but simply the load that displaying this workflow puts on the web browser and possibly the server.
~10 members X ~100 forecast hours results in a ~1000 task matrix, there are two of these matrices (one for the sniffer task outputs and one for the *_process
tasks) and three cycles (P3 runahead limit), so potentially ~6000 tasks/outputs for the GUI to process/display. When you start pushing into the high thousands of tasks the web browser is going to start struggling.
The GUI is powered by a stream of continuous updates sent by the server. Each has to be applied in turn, the corresponding graphical elements are then generated from this data. If opening one workflow causes issues with other workflows this suggests that the updates for the larger workflow may be flooding the server.
Workarounds
Host
Running this workflow causes a fair amount of load on my box. With the schedulers and servers both running on the same host it might be feeling the strain compounding the problem. So it’s definitely worth checking the CPU pressure on this host, if it’s high consider running the server(s) on a different host to give them a fighting chance.
View
There are two things that will cause the browser to run slowly for this example:
- The amount of data it has to process.
- The number of items it has to display.
We can’t do anything about the first issue at present, however, we can reduce the number of tasks being displayed by switching from the “Tree” view to the “Table” view. as the “Table” view paginates results, you can set it as the default from the settings page.
Workflow
The most effective solution is to reduce the matrix size, E.G. group together tasks from multiple members or forecast hours. I appreciate this introduces artificial dependencies which you might rather avoid as I’m guessing this is a post-processing workflow that feeds off of data coming from somewhere else?
Fixes
There may be things we can do to reduce the impact of this workflow on the server / GUI.
Now we have an example to profile against, we can see what parts of the system it’s stressing and what optimisations we can perform.