More efficient data movement with MPI#

Just like we did manually with memmap, you can move data more efficiently with MPI by sending it to just one engine, and using MPI to broadcast it to the rest of the engines.

import socket
import os, sys, re

import numpy as np

import ipyparallel as ipp

For this demo, I will connect to a cluster with engines started with MPI.

One way to do so would be:

ipcluster start -n 64 --engines=MPI --profile mpi

In this directory is a docker-compose file to simulate multiple engine sets launched with MPI.

I ran this example with a cluster on a 64-core remote VM, so communication between the client and controller is over the public internet, while communication between the controller and engines is local.

rc = ipp.Client(profile="mpi")
rc.wait_for_engines(64)
eall = rc.broadcast_view(coalescing=True)
root = rc[0]
len(rc)
64
root['a'] = 5
%px from mpi4py.MPI import COMM_WORLD as MPI

We cn get a mapping of IPP rank to MPI rank, in case they mismatch.

In recent-enough IPython Parallel, they usually don’t because IPython engines request their MPI rank as their engine id.

mpi_ranks = eall.apply_async(lambda : MPI.Get_rank()).get_dict()
root_rank = root.apply_sync(lambda : MPI.Get_rank())
mpi_ranks
{0: 0,
 1: 1,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20,
 21: 21,
 22: 22,
 23: 23,
 24: 24,
 25: 25,
 26: 26,
 27: 27,
 28: 28,
 29: 29,
 30: 30,
 31: 31,
 32: 32,
 33: 33,
 34: 34,
 35: 35,
 36: 36,
 37: 37,
 38: 38,
 39: 39,
 40: 40,
 41: 41,
 42: 42,
 43: 43,
 44: 44,
 45: 45,
 46: 46,
 47: 47,
 48: 48,
 49: 49,
 50: 50,
 51: 51,
 52: 52,
 53: 53,
 54: 54,
 55: 55,
 56: 56,
 57: 57,
 58: 58,
 59: 59,
 60: 60,
 61: 61,
 62: 62,
 63: 63}
sz = 512
data = np.random.random((sz, sz))
megabytes = data.nbytes // (1024 * 1024)
megabytes
32
%%time 
ar = eall.push({'data': data}, block=False)
ar.wait_interactive()
CPU times: user 285 ms, sys: 94.4 ms, total: 379 ms
Wall time: 4.49 s
@ipp.interactive
def _bcast(key, root_rank):
    """function to run on engines as part of broadcast"""
    g = globals()
    obj = g.get(key, None)
    obj = MPI.bcast(obj, root_rank)
    g[key] = obj

def broadcast(key, obj, dv, root, root_rank):
    """More efficient broadcast by doing push to root,
    and MPI broadcast to other engines.
    
    Still O(N) messages, but all but one message is always small.
    """
    root.push({key : obj}, block=False)
    return dv.apply_async(_bcast, key, root_rank)
%%time
ar = broadcast('data', data, eall, root, root_rank)
ar.wait_interactive()
CPU times: user 252 ms, sys: 58.3 ms, total: 310 ms
Wall time: 939 ms

And we can quickly check that everyone got the same data by computing its norm

%%px
import numpy as np
np.linalg.norm(data, 2)
Out[0:2]: 255.8632587551305
Out[1:2]: 255.8632587551305
Out[2:2]: 255.8632587551305
Out[3:2]: 255.8632587551305
Out[4:2]: 255.8632587551305
Out[5:2]: 255.8632587551305
Out[6:2]: 255.8632587551305
Out[7:2]: 255.8632587551305
Out[8:2]: 255.8632587551305
Out[9:2]: 255.8632587551305
Out[10:2]: 255.8632587551305
Out[11:2]: 255.8632587551305
Out[12:2]: 255.8632587551305
Out[13:2]: 255.8632587551305
Out[14:2]: 255.8632587551305
Out[15:2]: 255.8632587551305
Out[16:2]: 255.8632587551305
Out[17:2]: 255.8632587551305
Out[18:2]: 255.8632587551305
Out[19:2]: 255.8632587551305
Out[20:2]: 255.8632587551305
Out[21:2]: 255.8632587551305
Out[22:2]: 255.8632587551305
Out[23:2]: 255.8632587551305
Out[24:2]: 255.8632587551305
Out[25:2]: 255.8632587551305
Out[26:2]: 255.8632587551305
Out[27:2]: 255.8632587551305
Out[28:2]: 255.8632587551305
Out[29:2]: 255.8632587551305
Out[30:2]: 255.8632587551305
Out[31:2]: 255.8632587551305
Out[32:2]: 255.8632587551305
Out[33:2]: 255.8632587551305
Out[34:2]: 255.8632587551305
Out[35:2]: 255.8632587551305
Out[36:2]: 255.8632587551305
Out[37:2]: 255.8632587551305
Out[38:2]: 255.8632587551305
Out[39:2]: 255.8632587551305
Out[40:2]: 255.8632587551305
Out[41:2]: 255.8632587551305
Out[42:2]: 255.8632587551305
Out[43:2]: 255.8632587551305
Out[44:2]: 255.8632587551305
Out[45:2]: 255.8632587551305
Out[46:2]: 255.8632587551305
Out[47:2]: 255.8632587551305
Out[48:2]: 255.8632587551305
Out[49:2]: 255.8632587551305
Out[50:2]: 255.8632587551305
Out[51:2]: 255.8632587551305
Out[52:2]: 255.8632587551305
Out[53:2]: 255.8632587551305
Out[54:2]: 255.8632587551305
Out[55:2]: 255.8632587551305
Out[56:2]: 255.8632587551305
Out[57:2]: 255.8632587551305
Out[58:2]: 255.8632587551305
Out[59:2]: 255.8632587551305
Out[60:2]: 255.8632587551305
Out[61:2]: 255.8632587551305
Out[62:2]: 255.8632587551305
Out[63:2]: 255.8632587551305