Dynamic load balancing

This is potentially related or adjacent to my discussion on reloading the global.cylc file.

I love that workflows can migrate themselves to clear a condemned host, and that workflows can pick which host to start on based on my criteria. This functionality is great, but it appears to have some flaws that we’re tripping over and maybe there are remediations that I don’t know about.

  1. Some of our models have many workflows. For example our city ensemble model has 35 separate workflows, all of which get launched at essentially the same time. This means that all 35 workflows tend to do the same calculation of which host is “least busy” at the same time and the majority tend to then pick the same host which is this most certainly not “least busy” by the time 20-odd of them have started running. I realise that we can add a “sleep” to delay the calculation of “least busy” but that doesn’t resolve the related issue when a number of colleagues all come back from lunch (or whenever) and start their workflows at about the same time and hit the same issue.

  2. When we need to do work on our cluster (such as VM upgrades) we condemn each host in turn (encouraging workflows to self-migrate themselves to other hosts), reboot the host etc. However at the end of the work, we end up with n-1 hosts running all of the workflows, and the final host running none. Given our clusters currently only have 2-3 hosts, this is a problem. (We will increase the hosts in our clusters, but that doesn’t entirely solve the issue)

We are looking for options to better rebalance these clusters. We have considered:

  • Placing a file containing the condemned hosts in specific users’ ~/.cylc/flow/ directories
  • Exploring setting/broadcasting other CYLC_CONF_PATH values that allow us to set condemned hosts for “groups” of workflows

It would be really interesting to know how others are handling the same issue.

Hi @jarich

Good question.

We really only have crude load balancing at the moment, by selecting hosts at workflow start-up. This will tend to even out the load over time, as workflows get shut down and restarted, but (as you note) it’s not ideal when a whole lot get started, or migrated off of a condemned host, all at once.

Workflow self-migration could be used for more dynamic balancing in principle, but at the moment it’s only triggered by the condemned host setting.

Not sure we’ve discussed this on the team, I’ll see …

I note that this generally won’t tend to even out the load over time in production for us. Generally - in production - our workflows are expected to run for months if not years with only the occasional reload when something in the definition changes (hence the need for a global config reload). While I’m sure we’ll be starting new workflows more regularly as we spin up production, this is not our long term trend.

In Cylc 7, we assigned different workflows to specific VMs to manage load. We had to shutdown all workflows for the multiple times per year patching and rebooting, which was disruptive and annoying. I was excited about the Cylc 8 auto migration functionality, but without some way of being able to rebalance at the VM level, this may be a less effective option for us than it initially appeared. Most of the time we’re going to be running workflows on one fewer hosts than we have available in our cluster, because - most of the time - we are not going to have started (or stopped and restarted) any workflows between patch and reboots.

Our auto-migration experiences have not yet been super smooth, but we’re hoping that the issues we are encountering are our fault, which is why we haven’t raised any new bugs for it yet.

On our system we have enough turnover that the load balances out over time. The hosts are big enough to cope with any short term load peaks.

This doesn’t cause issues in our setup, but we don’t have 35-strong workflow groups.

There are two sources of variation in restart time which you can configure if this is causing a problem:

  1. The [scheduler][main loop][auto restart]interval (the default is 10 mins I think). Setting a longer interval heps to spread the migration over a longer period (but doesn’t help with large numbers of workflows started at the same time).
  2. The [scheduler]auto restart delay. This adds a random sleep onto the auto restart time. This will help with your 35 workflows started at the same time problem. This is turned off by default, we don’t actually use this, but it might help in your setup.

we end up with n-1 hosts running all of the workflows

We have quite a dynamic research workload with workflows being started all the time, so the empty host quickly fills up.

Your system needs to be able to function with n-1 servers in order to allow servers to be pulled out of the pool via this mechanism so this amount of slack is necessary.

This can be avoided in cloud environments where we have the ability to spin servers up and down at will. Note the auto-restart plugin inspects the global configuration at the configured interval so should be compatible with dynamic servers. So in the cloud world, we can just spin up a new server when needed and not bother with recycling old servers avoiding the n-1 problem.

I had been looking at reducing [scheduler][main loop][auto restart]interval to 1 minute so we can failover faster. When we want to drain a host for maintenance, it’d be nice if that didn’t take too long.

Do you see issues with setting the auto restart interval to 1 minute and enabling delay? That will, if we continue to use a load average of 15 minutes, tend to send all jobs to the same host. I was thinking that changing the host selection to -1 * cpu_times().idle might lead to something which we can drain quickly and will spread load around a bit more as selection criteria is a little more responsive.

Thoughts?

Hi @psa

Disclaimer: we haven’t been using these settings at my site yet (we will do on the new HPC, coming soon), so the following may be uninformed and the UK team may be better placed to advise…

I think a shorter interval, plus a delay, should be OK. Load averaging time, for host ranking, will presumably have an impact, but it probably depends quite strongly on what your workflows look like (many small ones, a few big ones, something in between…).

I’m inclined to suggest you test it and see how it works.

Note that a bit of “manual” load re-balancing is perfectly feasible if needed: just manually stop (cylc stop --now) and restart some workflows. That’s a safe and quick operation, even if long-running tasks are active at the time (and even if tasks succeed or fail between stop and restart).

Sounds fine to me. The only negative impact of this is going to be slightly increased filesystem load.

Yes, using a 15 min average is going to cause problems if you want all of your workflows to restart within this window. There is also a 5 minute average and instantaneous CPU usage which might be more suitable.

We used to use load average for load balancing with Cylc 6. We switched to use memory sometime after moving to Cylc 7 which we found resulted in more even distribution. Here’s the load balancing recipe we use:

# reject servers with high load
getloadavg()[2] < 20

# reject servers with low memory
virtual_memory().available > 2000000000

# pick the server with the most available memory
-1 * virtual_memory().available

Does stop --now and play again work fine if a workflow has a long-running local background task? Will it all work nicely (polling, etc)?

Will it all work nicely (polling, etc)?

No.

However, the auto-restart mechanism will wait for any local background jobs to finish before restarting to avoid this problem.

I thought I would share with you a screenshot of our Cylc servers going through automated migration so you can see how this process looks in our setup:

  • This is a 16 server research setup normally running ~1’000 workflows (peak usage ~1’200).
  • There’s a fair amount of headroom in this setup at present. Load was a little lower than normal in the time window, but we could get away with fewer servers.
  • We condemn one host at a time using an automated system. You can see the free memory go up as the server drains and drop again after it is returned into the pool.
  • Sometimes the server takes longer to migrate (free memory flatlines for a few mins), this is mostly due to workflows waiting for local background jobs to clear out before migrating.
  • We use memory as the primary metric for load balancing. Most of the workflows restart on the newly restored server, but if you look closely you can see the server with the most available memory changes around a few times.
  • We use the default [scheduler][main loop][auto restart]interval = PT10M, we don’t specify a [scheduler]auto restart delay.
  • The last server was returned into service after office hours so didn’t return to normal loading until the following morning.
  • CPU load was low and steady throughout (these VMs have a higher CPU to memory ratio than required anyway).
  • (Server 7 had a memory blip unrelated to the migration).

1 Like