TOD-Tree：用于混合 MPI 并行性和 GPU 的任务重叠直接发送树图像合成。

TOD-Tree: Task-Overlapped Direct Send Tree Image Compositing for Hybrid MPI Parallelism and GPUs.

出版信息

IEEE Trans Vis Comput Graph. 2017 Jun;23(6):1677-1690. doi: 10.1109/TVCG.2016.2542069. Epub 2016 Mar 14.

DOI:10.1109/TVCG.2016.2542069

Abstract

Modern supercomputers have thousands of nodes, each with CPUs and/or GPUs capable of several teraflops. However, the network connecting these nodes is relatively slow, on the order of gigabits per second. For time-critical workloads such as interactive visualization, the bottleneck is no longer computation but communication. In this paper, we present an image compositing algorithm that works on both CPU-only and GPU-accelerated supercomputers and focuses on communication avoidance and overlapping communication with computation at the expense of evenly balancing the workload. The algorithm has three stages: a parallel direct send stage, followed by a tree compositing stage and a gather stage. We compare our algorithm with radix-k and binary-swap from the IceT library in a hybrid OpenMP/MPI setting on the Stampede and Edison supercomputers, show strong scaling results and explain how we generally achieve better performance than these two algorithms. We developed a GPU-based image compositing algorithm where we use CUDA kernels for computation and GPU Direct RDMA for inter-node GPU communication. We tested the algorithm on the Piz Daint GPU-accelerated supercomputer and show that we achieve performance on par with CPUs. Last, we introduce a workflow in which both rendering and compositing are done on the GPU.

摘要

现代超级计算机有成千上万个节点，每个节点都配备了能够达到数太拉弗（teraflops）级别的 CPU 和/或 GPU。然而，连接这些节点的网络相对较慢，大约在每秒千兆位（gigabits per second）的量级。对于像交互式可视化这样的时间关键型工作负载，瓶颈不再是计算，而是通信。在本文中，我们提出了一种图像合成算法，它可以在仅使用 CPU 和 GPU 加速的超级计算机上运行，重点是避免通信并将通信与计算重叠，以牺牲均匀平衡工作负载为代价。该算法有三个阶段：并行直接发送阶段、树合成阶段和收集阶段。我们在 Stampede 和 Edison 超级计算机上的混合 OpenMP/MPI 设置中，将我们的算法与来自 IceT 库的 radix-k 和 binary-swap 进行了比较，展示了强大的扩展结果，并解释了我们如何通常比这两种算法实现更好的性能。我们开发了一种基于 GPU 的图像合成算法，其中我们使用 CUDA 内核进行计算，并使用 GPU 直接 RDMA 进行节点间 GPU 通信。我们在 Piz Daint GPU 加速超级计算机上对该算法进行了测试，结果表明我们实现了与 CPU 相当的性能。最后，我们引入了一种工作流程，其中渲染和合成都在 GPU 上完成。