Suppr超能文献

TOD-Tree:用于混合 MPI 并行性和 GPU 的任务重叠直接发送树图像合成。

TOD-Tree: Task-Overlapped Direct Send Tree Image Compositing for Hybrid MPI Parallelism and GPUs.

出版信息

IEEE Trans Vis Comput Graph. 2017 Jun;23(6):1677-1690. doi: 10.1109/TVCG.2016.2542069. Epub 2016 Mar 14.

Abstract

Modern supercomputers have thousands of nodes, each with CPUs and/or GPUs capable of several teraflops. However, the network connecting these nodes is relatively slow, on the order of gigabits per second. For time-critical workloads such as interactive visualization, the bottleneck is no longer computation but communication. In this paper, we present an image compositing algorithm that works on both CPU-only and GPU-accelerated supercomputers and focuses on communication avoidance and overlapping communication with computation at the expense of evenly balancing the workload. The algorithm has three stages: a parallel direct send stage, followed by a tree compositing stage and a gather stage. We compare our algorithm with radix-k and binary-swap from the IceT library in a hybrid OpenMP/MPI setting on the Stampede and Edison supercomputers, show strong scaling results and explain how we generally achieve better performance than these two algorithms. We developed a GPU-based image compositing algorithm where we use CUDA kernels for computation and GPU Direct RDMA for inter-node GPU communication. We tested the algorithm on the Piz Daint GPU-accelerated supercomputer and show that we achieve performance on par with CPUs. Last, we introduce a workflow in which both rendering and compositing are done on the GPU.

摘要

现代超级计算机有成千上万个节点,每个节点都配备了能够达到数太拉弗(teraflops)级别的 CPU 和/或 GPU。然而,连接这些节点的网络相对较慢,大约在每秒千兆位(gigabits per second)的量级。对于像交互式可视化这样的时间关键型工作负载,瓶颈不再是计算,而是通信。在本文中,我们提出了一种图像合成算法,它可以在仅使用 CPU 和 GPU 加速的超级计算机上运行,重点是避免通信并将通信与计算重叠,以牺牲均匀平衡工作负载为代价。该算法有三个阶段:并行直接发送阶段、树合成阶段和收集阶段。我们在 Stampede 和 Edison 超级计算机上的混合 OpenMP/MPI 设置中,将我们的算法与来自 IceT 库的 radix-k 和 binary-swap 进行了比较,展示了强大的扩展结果,并解释了我们如何通常比这两种算法实现更好的性能。我们开发了一种基于 GPU 的图像合成算法,其中我们使用 CUDA 内核进行计算,并使用 GPU 直接 RDMA 进行节点间 GPU 通信。我们在 Piz Daint GPU 加速超级计算机上对该算法进行了测试,结果表明我们实现了与 CPU 相当的性能。最后,我们引入了一种工作流程,其中渲染和合成都在 GPU 上完成。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验