Slow load of Cylc workflows, disconnects

hilary.j.oliver · December 6, 2023, 10:11pm

Not really. A web UI was the almost-universal preference, for a whole bunch of reasons, and developing more than one full-blown GUI for a system like Cylc would be a BIG job (noting that TUI will actually fit that bill once sufficiently performant and enhanced a bit).

I’ve done a bit of testing myself with the example as posted by @oliver.sanders above (tweaked a bit, see below).

You’ve probably seen comments that the Cylc 8 scheduler is much more efficient than Cylc 7, because it only needs to manage “active” tasks. And the scheduler does indeed handle this example fine. However it is still a beast of a workflow from the UI perspective. We’ve designed the UI to display a window of n (default 1) graph-edges around active(*) tasks, which usually is very efficient in terms of not displaying tasks that are far removed from the current action, but in this case, with 1000s of outputs leading out from the active tasks, the default n=1 window still contains a huge number of tasks.

My suggestions for the moment:

consider changing the workflow structure artificially with some dummy tasks, to reduce the n=1 window size
change the default view (UI preferences page) to the paginated table view. The amount of data in the browser will be the same, but the job of displaying it will be easier
also consider filtering by task state, to display only the active tasks
set the n-window extent to n=0 (UI workflow drop-down menu item). The UI will only get data about the active tasks. Both the data volume and display work will be massively reduced. The UI will no longer show what tasks come next until they enter the active window, but it will still fulfill the primary job of a live monitoring and control system. Note that changing the n-window extent does not have an instant effect. It will change when the next data update gets pushed from the UI Server.

[(*) “active” means literally active, plus tasks that are ready to run by the task graph but held back for other reasons such as Cylc queue, task hold, or xtrigger]

hilary.j.oliver · December 8, 2023, 2:56am

@puskar49 - I took a closer look at your workflow. If think it can easily be restructured to make it far more efficient.

sniffer:ready_<member,fcsthr> => <member,fcsthr>_process?
# i.e.:
sniffer => ALL_PROCESSING_TASKS

I’m guessing (it all hinges on this) that sniffer watches the members of an ensemble of forecast models that each generate a sequence of output files as they run. Whenever it sees a new file it emits an output message to triggers the processing task for the new file.

The problem is the graph structure does not tell Cylc that these files (for each member) are generated in sequence rather than all at once. So the UI sees ALL of those tasks - thousands of them - at once in the n=1 window.

Instead, you can have a separate lightweight sniffer task for every (member, hour), and for each member make them run in sequence. For a given member, there’s no need to start sniffing for file 10 before file 9 is found, because the model generates the files in sequence.

start => sniffer_<mem, hour=0>
sniffer<mem, hour-1> => sniffer_<mem, hour>
sniffer_<mem, hour> => process<mem, hour>

Then you can simplify the sniffer outputs, because outputs are tied to the task instance.

Here’s my working test case, for which the UI loads instantly and remains 100% responsive:

#!Jinja2
{% set members = 10 %} 
{% set hours = 100 %}
[task parameters]
    member = 0..{{members}}
    fcsthr = 0..{{hours}}
  [[templates]]
    member = member%(member)03d
    fcsthr = _fcsthr%(fcsthr)03d
[scheduling]
  initial cycle point = 2000
  runahead limit = P3
  [[queues]]
    [[[default]]]
        limit = 10
  [[xtriggers]]
    start = wall_clock(offset=PT7H15M)
  [[graph]]
    T00,T06,T12,T18 = """
        @start & prune[-PT6H]:finish => prune & purge
        @start => sniffer_<member,fcsthr=0>
        sniffer_<member,fcsthr-1> => sniffer_<member, fcsthr>
        sniffer_<member,fcsthr>:ready => <member,fcsthr>_process? => finish
        <member,fcsthr>_process:fail? => fault
      """
[runtime]
    [[root]]
            pre-script = "sleep 10" 
    [[sniffer_<member,fcsthr>]]
        script = "cylc message 'file ready'"
        [[[outputs]]]
            ready = "file ready"
    [[prune, purge, fault, finish]]

Note I’ve used a queue with limit=10 to prevent my laptop from being overloaded with task jobs (although that is much less of a problem with sequential sniffer tasks). But that is pretty much irrelevant to issue - it is not the number of running tasks that is the problem.

Hope that helps?

puskar49 · December 8, 2023, 4:04pm

Thank you for your suggestions, but I have some concerns with this approach.

First, correct me if I’m wrong, but wouldn’t this solution require that the forecast hours be sequential or at least have even spacing? I’m looking at sniffer_<member,fcsthr-1> and assume that relies on subtracting one from the value of fcsthr, not the index of the parameter list? If so, that would not work for us since our forecast hour list is not evenly spaced.

Additionally, wouldn’t this double the number of overall tasks? Instead of one sniffer job (that is already quite lightweight) we’d have 1000 sniffer tasks leading to 1000 post processing jobs. Job submission failures are always a concern and I fear this approach would lead to delays because any one of the 1000 sniffer jobs could fail and all subsequent forecast hours would be delayed in waiting for a retry. The current sniffer will send messages for all fields files that it finds in any given retry (and indeed usually starts multiple forecast hours at once).

Another problem with this approach is our inability to track delays. We use a sniffer for both the start files we receive and the fields files generated by UM. In both cases, we track if there are delays in receipt of the first file as well as delays between files. That would be problematic in a situation where each sniffer is independent.

The graph we provided is only for the post processing piece of our workflow. So for the full workflow, we would have an additional 260 tasks for the start file sniffer, and that’s assuming a 10 member ensemble (the long-term goal is 18 members). And the start file sniffer doesn’t use numbers and the files do not arrive sequentially.

I understand Cylc 8 has many improvements and I respect the work that has gone into it. And this information might not necessarily be relevant to this conversation, but I would like to state again that we ran this ensemble suite with both model and post processing with 18 members in Cylc 7 without any Cylc problems. We’re not opposed to adjusting our approach (and indeed we are already in the progress of splitting up our members into multiple workflows) but if there is any possibility of improving the efficiency of the Cylc 8 GUI we’d definitely be interested in that.

oliver.sanders · December 8, 2023, 4:40pm

I quite like the approach you’ve taken (and use it myself), however, ~1000 tasks is clearly pushing scaling too far at the moment.

Hillary suggested an odd workaround in a previous response which may be worth investigating.

set the n-window extent to n=0 (UI workflow drop-down menu item)

This would essentially filter out many of the waiting and succeeded tasks, reducing the amount of data the GUI has to process. You can action this in the GUI by clicking on the workflow icon, then selecting the “set graph window extent” command and entering the value 0.

If this helps you can automate this by having the following script run at the start of the workflow (e.g. in a startup handler or task):

set +e

read -r -d '' gqlDoc <<_DOC_
{"request_string": "
mutation {
  setGraphWindowExtent (
    workflows: [\"${CYLC_WORKFLOW_ID}\"],
    nEdgeDistance: 4) {
    result
  }
}",
"variables": null}
_DOC_

echo "${gqlDoc}"

cylc client "$CYLC_WORKFLOW_ID" graphql < <(echo ${gqlDoc}) 2>/dev/null

set -e

Parameter offsets are index based rather than value based so this doesn’t impose even spacing restriction, only sequential restriction, e.g:

[task parameters]
    foo = a, b, c

graph = one<foo> => two<foo-1>

Additionally, wouldn’t this double the number of overall tasks?

Yes it would.

However, these tasks would only run one at a time, so the overall impact on the system would be similar.

The difference is in the task overheads, I.E. the job submission, and the startup cost of the application.

I fear this approach would lead to delays because any one of the 1000 sniffer jobs could fail

To build in failure resistance, use optional outputs:

sniffer<mem, hour-1>? => sniffer_<mem, hour>?

but if there is any possibility of improving the efficiency of the Cylc 8 GUI we’d definitely be interested in that.

There absolutely is and we would love to make this example run more smoothly for you. We have already identified some optimisations that would improve the situation.

E.G: This proposed change should avoid the issue affecting other workflows as you reported. I believe there should be reasonable scope for optimisation of the JavaScript code that is currently causing the browser to freeze for your workflow too.

Be aware, the core team is very small and it’s members have responsibilities to their own sites which aren’t encountering this problem at present, so these investigations and subsequent work may take a little while to complete as there is a lot of other high priority work ongoing at the moment. So you might want to look into a change of approach as a workaround in the short term whilst we work on this alongside other matters, as I don’t think there will be a single change that will resolve this. If you have any resource to contribute do let us know, we can help direct this in the right direction.

puskar49 · December 8, 2023, 4:43pm

Understood all the way around. Thanks again for your consideration and suggestions – I really appreciate it!

hilary.j.oliver · December 9, 2023, 8:22am

@puskar49 - @oliver.sanders has responded on most points. A couple more notes…

Yes but it’s not the sheer number of tasks that is the problem, it’s the fact that your current structure makes them all appear at once, at the same place in the graph, even though they don’t actually run like that.

The job submission retry delay is configurable, and it can be zero - i.e. instant retry.

Also, Cylc can submit different tasks to different platforms. These sniffer tasks will be very light, and not many of them will run at once (one per ensemble member, basically) so you could run them as background jobs on an appropriate server, possibly even on the scheduler host, rather than submit them to your resource manager.

OK, interesting. There are probably other good ways to do that though. E.g. an event-handler for sniffer-task job submission or success, that updates some kind of monitor.

Great, I was going to suggest that too. There are some advantages to a more modular system of smaller workflows. And you can have cross-workflow triggering where needed.

oliver.sanders · January 4, 2024, 5:11pm

Have begun profiling. As feared, there’s no single source of the issue, but there is some low hanging fruit which should improve the situation. These two issues track the investigation at the server and client:

Keep up to date with the latest Cylc releases (especially cylc-uiserver releases) and improvements will trickle through as we knock them off.

oliver.sanders · January 31, 2024, 12:07pm

With a few optimisations we’ve managed reduced the load times by a fair amount.

To load the tree view took between 60 and 100 seconds before, it now takes 24 seconds. The faster table view now takes 15 seconds to load.

Still not ideal, but hopefully enough to make the GUI usable for more examples. The bulk of these optimisations will arrive in cylc-uiserver 1.5.0 which I expect we will release before too long.

Further optimisations are on the books for the future but will be more involved so will take longer.

Topic		Replies	Views
Disconnects in Cylc UI logs Cylc Support	8	216	December 1, 2023
Repetitive timeouts in the UI server Cylc Support	8	29	August 5, 2024
Cylc Hub disconnection issues Cylc Support	4	47	May 8, 2025
Beginning to use 8.4 Cylc 8 Migration	9	62	March 19, 2025
Runahead limit not respected Cylc Support	7	208	January 10, 2024

Slow load of Cylc workflows, disconnects

Related topics