How to get parallelisation with rose bunch working on all cores

I’m having issues with rose bunch.
The job contains 7 members, running on only a few cores each, so it can fit on one node easily. It’s a UM reconfiguration task, using a wrapper script to call um-recon, which calls mpirun um-recon.exe. Pretty standard for a NWP suite like the UKV (which this is kind of based on).

I’m running at NCI, which is not a cray (so no mom-node). This has bearing on where the process is that starts up the parallel subprocesses. For example, pretty sure I can’t run rose bunch across multiple nodes.

I was initially using openmpi to compile the UM. In which case my mpirun options were “-n 4 --bind-to none”. The binding part made the processes run on different cores - otherwise they would all run on the same cores and not save any time.
However, I’m trying to change to intel mpi for various reasons (better local support; my UM forecast can’t use IOServer with openmpi and I don’t know why; …). But with intel mpi the jobs all run on the same cores. I switched to running single-core processes, and it’s pretty clear that if I run 7 members, they are running on the same core and it takes 7 times as long as if I run 1 member by itself.
And if I try to run all jobs at once in rose-bunch, particularly on 4 cores each, I mostly get a segfault at the stage of reading the striping on a file. No idea if this is an effect of trying to run everything on the same cores, but there’s a bit of a correlation.

I tried some intel environment variables controlling pinning (Environment Variables for Process Pinning). Which so far either had no effect, or resulted in a FPE while doing buffin_64 at a gc_rbcast very early on in the run… So, no idea what that means, but it wasn’t successful.

Anyway, rose bunch is using python to start the processes (on separate cores, in theory, right?). So maybe the intel environment can’t affect this. even though the openmpi mpirun options did have some effect, I haven’t seen similar results with intelmpi. So how can I ensure that rose (i.e. python) is going to correctly send parallel jobs to different cores? (I’m sure this used to work on NCI’s old machine, but not the new one…).

Any thoughts appreciated.

Hello,

Unfortunately I don’t think I’m going to be able to shed much light on what’s going on but here’s some quick info on Rose Bunch.

For example, pretty sure I can’t run rose bunch across multiple nodes.

Yes, Rose Bush is just a simple command runner, it’s not node aware.

Anyway, rose bunch is using python to start the processes (on separate cores, in theory, right?)

Rose Bunch is a very simple wrapper for firing off sub-process, it uses Python’s subprocess module under the hood but the exact implementation is not important. It’s just starting subprocesses as you would on the command line or from a bash script.

I.E. The following rose_bunch app:

mode=rose_bunch
meta=rose_bunch

[bunch]
command-format=mycommand %(arg1)s
pool-size=5

[bunch-args]
arg1=1 2 3 4 5

Is effectively equivalent to the following bash script:

mycommand 1 &
mycommand 2 &
mycommand 3 &
mycommand 4 &
mycommand 5 &
wait
...

Just with some added extras for capturing stdout/err, return codes, etc.

Rose Bunch doesn’t have any influence over the distribution of processes or threads over available compute resource from the way it launches these processes.

Hi Oliver,

Thanks for your reply. That’s clarified my understanding of how things work, and it has given me some ideas to test what is happening in my job to see if it is behaving the way I would want.

Susan

Hi Susan,

Original rose bunch author here (weirdly enough it was written as a tool for running muliple serial um-recons when we moved to our Cray from an IBM). Your understanding’s pretty much spot on here as to what’ll be happening under the hood. I don’t believe you’d be able to get it to run cross-node (you certainly can’t out the box on XC40 compute nodes).

While it was only really ever intended for use on serial/shared type nodes rather than compute nodes it’s possible to force things to behave to a degree on our Cray on compute nodes if you can find the right magic incantation. I’ve not run on NCI but there may be a few places you can get rose-bunch to behave for you based on lessons learned running on our HPC - though it’s really forcing it to behave in a way it wasn’t intended to. On our compute nodes on our Cray, in order to get things to place properly from rose bunch, we often have to use aprun at the suite level so command script=aprun rose tast-run or some variant. In the case of the Cray that gets each bunch item to place correctly on separate cores, the main gotcha there seems to be that you may find that it doesn’t re-use earlier cores so yes you can parallelise but not necessarily run e.g 2 batches of 10. Your other option is to try inlining an mpi-launch type command in your command along lines of command-format=mpi_launch mycommand ... though I’ve never managed to get that to work right on our Cray either.

The other issue you may come across if things are partially working is memory contention. Again, I can’t speak to NCI’s machine, but I can cause segfaults for things that should work on our Cray by virtue of which processors are associated with which memory (each half an XC40 compute nodes’ cpus are associated with one half of the node’s memory) so you can get segfaults where you run off the end of the associated ram if you’re not able to control memory/cpu pairings.

For debug purposes it may be that the environment you’re running in has some sort of processor id variable you can spit out at the start of your bunch item which will hopefully help diagnose what’s going on but again I’m no intel expert.

Hopefully some of these pointers are useful to you.

Best,

Andy

Hi Andy,
Thanks for your response.

I have found how to spit out processor ids and I ran with a few variations.
With intel mpi, the um-recon.exe are all ultimately launched on different processors (no matter where the preceding calling scripts ran).
With openmpi, they are all launched on the same processor, unless “–bind-to none” is passed to mpirun, in which case they are spread around.
Either way, they all take the same amount of time!

However, if all 7 members run simultaneously, there is clearly some serious IO contention, since the timing summary information shows that everything takes about 7 times longer than if I run each um-recon.exe in serial (pool size of 1).

Perhaps because this task is relatively fast, so IO makes up a large part of it, but this means that parallelising has no benefit at all! I’m sure this isn’t true of all types of tasks…

So, in conclusion, I think that I understand what rose-bunch is doing, and how to make sure it is doing the right thing. But it’s not always useful for jobs that are mostly IO. And as for running across nodes, I got creative with jinja to auto-create multiple rose-bunch tasks to split my 18 jules jobs into e.g. 6 separate tasks. It seems to do the job well enough.

Kind regards,
Susan

1 Like