Kutzner Carsten, van der Spoel David, Fechner Martin, Lindahl Erik, Schmitt Udo W, de Groot Bert L, Grubmüller Helmut
Department of Theoretical and Computational Biophysics, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany.
J Comput Chem. 2007 Sep;28(12):2075-84. doi: 10.1002/jcc.20703.
We investigate the parallel scaling of the GROMACS molecular dynamics code on Ethernet Beowulf clusters and what prerequisites are necessary for decent scaling even on such clusters with only limited bandwidth and high latency. GROMACS 3.3 scales well on supercomputers like the IBM p690 (Regatta) and on Linux clusters with a special interconnect like Myrinet or Infiniband. Because of the high single-node performance of GROMACS, however, on the widely used Ethernet switched clusters, the scaling typically breaks down when more than two computer nodes are involved, limiting the absolute speedup that can be gained to about 3 relative to a single-CPU run. With the LAM MPI implementation, the main scaling bottleneck is here identified to be the all-to-all communication which is required every time step. During such an all-to-all communication step, a huge amount of messages floods the network, and as a result many TCP packets are lost. We show that Ethernet flow control prevents network congestion and leads to substantial scaling improvements. For 16 CPUs, e.g., a speedup of 11 has been achieved. However, for more nodes this mechanism also fails. Having optimized an all-to-all routine, which sends the data in an ordered fashion, we show that it is possible to completely prevent packet loss for any number of multi-CPU nodes. Thus, the GROMACS scaling dramatically improves, even for switches that lack flow control. In addition, for the common HP ProCurve 2848 switch we find that for optimum all-to-all performance it is essential how the nodes are connected to the switch's ports. This is also demonstrated for the example of the Car-Parinello MD code.
我们研究了GROMACS分子动力学代码在以太网Beowulf集群上的并行缩放情况,以及即使在带宽有限和延迟较高的此类集群上实现良好缩放所需的前提条件。GROMACS 3.3在诸如IBM p690(Regatta)的超级计算机以及具有诸如Myrinet或Infiniband等特殊互连的Linux集群上缩放效果良好。然而,由于GROMACS的单节点性能较高,在广泛使用的以太网交换集群上,当涉及到两个以上的计算机节点时,缩放通常会失效,相对于单CPU运行,可获得的绝对加速比限制在约3倍。使用LAM MPI实现时,主要的缩放瓶颈被确定为每个时间步所需的全对全通信。在这样的全对全通信步骤中,大量消息充斥网络,结果许多TCP数据包丢失。我们表明以太网流量控制可防止网络拥塞并带来显著的缩放改进。例如,对于16个CPU,已实现了11倍的加速比。然而,对于更多节点,此机制也会失效。通过优化以有序方式发送数据的全对全例程,我们表明对于任意数量的多CPU节点都可以完全防止数据包丢失。因此,即使对于缺乏流量控制的交换机,GROMACS的缩放也会显著改善。此外,对于常见的HP ProCurve 2848交换机,我们发现节点连接到交换机端口的方式对于最佳全对全性能至关重要。这也通过Car-Parinello MD代码的示例得到了证明。