Kumar Sameer, Heidelberger Philip, Chen Dong, Hines Michael
IBM T.J. Watson Research Center Yorktown Heights, NY, 10598.
Proc IPDPS (Conf). 2010 Apr 19;2010:1-11. doi: 10.1109/IPDPS.2010.5470407.
We explore the multisend interface as a data mover interface to optimize applications with neighborhood collective communication operations. One of the limitations of the current MPI 2.1 standard is that the vector collective calls require counts and displacements (zero and nonzero bytes) to be specified for all the processors in the communicator. Further, all the collective calls in MPI 2.1 are blocking and do not permit overlap of communication with computation. We present the record replay persistent optimization to the multisend interface that minimizes the processor overhead of initiating the collective. We present four different case studies with the multisend API on Blue Gene/P (i) 3D-FFT, (ii) 4D nearest neighbor exchange as used in Quantum Chromodynamics, (iii) NAMD and (iv) neural network simulator NEURON. Performance results show 1.9× speedup with 32(3) 3D-FFTs, 1.9× speedup for 4D nearest neighbor exchange with the 2(4) problem, 1.6× speedup in NAMD and almost 3× speedup in NEURON with 256K cells and 1k connections/cell.
我们将多发送接口作为一种数据移动器接口进行探索,以通过邻域集合通信操作来优化应用程序。当前MPI 2.1标准的局限性之一在于,向量集合调用要求为通信器中的所有处理器指定计数和位移(零字节和非零字节)。此外,MPI 2.1中的所有集合调用都是阻塞式的,不允许通信与计算重叠。我们提出了对多发送接口的记录重放持久优化,以最小化发起集合操作时的处理器开销。我们展示了在蓝色基因/P上使用多发送API的四个不同案例研究:(i)三维快速傅里叶变换,(ii)量子色动力学中使用的四维最近邻交换,(iii)纳米分子动力学模拟程序,以及(iv)神经网络模拟器NEURON。性能结果表明,对于32(3)个三维快速傅里叶变换,加速比为1.9倍;对于具有2(4)问题的四维最近邻交换,加速比为1.9倍;在纳米分子动力学模拟程序中加速比为1.6倍;在具有256K个细胞和每个细胞1k个连接的NEURON中加速比近3倍。