Teodoro George, Pan Tony, Kurc Tahsin, Kong Jun, Cooper Lee, Saltz Joel
Center for Comprehensive Informatics and Biomedical Informatics Department, Emory University, Atlanta, GA 30322.
Parallel Comput. 2013 Apr 1;39(4-5):189-211. doi: 10.1016/j.parco.2013.03.001.
We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.
我们解决了在具有多个CPU和GPU的混合系统上高效执行一种计算模式(这里称为不规则波前传播模式,即IWPP)的问题。IWPP在多种图像处理操作中很常见。在IWPP中,如果满足传播条件,波前中的数据元素会将波传播到网格上的相邻元素。接收到传播波的元素会成为波前的一部分。这种模式会导致不规则的数据访问和计算。我们开发并评估了使用多级队列结构来高效计算和传播波前的策略。这种队列结构提高了GPU中快速内存的利用率,并减少了同步开销。我们还开发了一种基于瓦片的并行化策略,以支持在多个CPU和GPU上执行。我们使用两种广泛使用的图像处理操作(形态学重建和欧几里得距离变换)的IWPP实现,在一台先进的GPU加速机器(配备3个GPU和2个多核CPU)上评估我们的方法。我们的结果表明在GPU上有显著的性能提升。对于形态学重建和欧几里得距离变换,分别使用多个CPU和GPU协同实现了相对于单核CPU执行50倍和85倍的加速。