I’m having issues with rose bunch.
The job contains 7 members, running on only a few cores each, so it can fit on one node easily. It’s a UM reconfiguration task, using a wrapper script to call um-recon, which calls mpirun um-recon.exe. Pretty standard for a NWP suite like the UKV (which this is kind of based on).
I’m running at NCI, which is not a cray (so no mom-node). This has bearing on where the process is that starts up the parallel subprocesses. For example, pretty sure I can’t run rose bunch across multiple nodes.
I was initially using openmpi to compile the UM. In which case my mpirun options were “-n 4 --bind-to none”. The binding part made the processes run on different cores - otherwise they would all run on the same cores and not save any time.
However, I’m trying to change to intel mpi for various reasons (better local support; my UM forecast can’t use IOServer with openmpi and I don’t know why; …). But with intel mpi the jobs all run on the same cores. I switched to running single-core processes, and it’s pretty clear that if I run 7 members, they are running on the same core and it takes 7 times as long as if I run 1 member by itself.
And if I try to run all jobs at once in rose-bunch, particularly on 4 cores each, I mostly get a segfault at the stage of reading the striping on a file. No idea if this is an effect of trying to run everything on the same cores, but there’s a bit of a correlation.
I tried some intel environment variables controlling pinning (Environment Variables for Process Pinning). Which so far either had no effect, or resulted in a FPE while doing buffin_64 at a gc_rbcast very early on in the run… So, no idea what that means, but it wasn’t successful.
Anyway, rose bunch is using python to start the processes (on separate cores, in theory, right?). So maybe the intel environment can’t affect this. even though the openmpi mpirun options did have some effect, I haven’t seen similar results with intelmpi. So how can I ensure that rose (i.e. python) is going to correctly send parallel jobs to different cores? (I’m sure this used to work on NCI’s old machine, but not the new one…).
Any thoughts appreciated.