Grasso Ivan, Pellegrini Simone, Cosenza Biagio, Fahringer Thomas
Institute of Computer Science, University of Innsbruck, Austria ; Barcelona Supercomputing Center, Barcelona, Spain.
Institute of Computer Science, University of Innsbruck, Austria.
J Parallel Distrib Comput. 2014 Dec;74(12):3228-3239. doi: 10.1016/j.jpdc.2014.08.002.
Large-scale compute clusters of heterogeneous nodes equipped with multi-core CPUs and GPUs are getting increasingly popular in the scientific community. However, such systems require a combination of different programming paradigms making application development very challenging. In this article we introduce libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater consists of a simple interface, which is a transparent abstraction of the underlying distributed architecture, offering advanced features such as inter-context and inter-node device synchronization. It provides a runtime system which tracks dependency information enforced by event synchronization to dynamically build a DAG of commands, on which we automatically apply two optimizations: collective communication pattern detection and device-host-device copy removal. We assess libWater's performance in three compute clusters available from the Vienna Scientific Cluster, the Barcelona Supercomputing Center and the University of Innsbruck, demonstrating improved performance and scaling with different test applications and configurations.
配备多核CPU和GPU的异构节点大规模计算集群在科学界越来越受欢迎。然而,这样的系统需要结合不同的编程范式,这使得应用程序开发极具挑战性。在本文中,我们介绍了libWater,这是一种基于库的OpenCL编程模型扩展,它简化了异构分布式应用程序的开发。libWater由一个简单的接口组成,该接口是底层分布式架构的透明抽象,提供诸如上下文间和节点间设备同步等高级功能。它提供了一个运行时系统,该系统跟踪由事件同步强制执行的依赖信息,以动态构建命令的有向无环图(DAG),我们在该图上自动应用两种优化:集体通信模式检测和设备-主机-设备副本移除。我们在维也纳科学集群、巴塞罗那超级计算中心和因斯布鲁克大学提供的三个计算集群中评估了libWater的性能,展示了在不同测试应用程序和配置下性能的提升和扩展性。