Brunn Malte, Himthani Naveen, Biros George, Mehl Miriam, Mang Andreas
Computer Science, University of Stuttgart, Stuttgart, DE.
Oden Institute, University of Texas, Austin TX, US.
Int Conf High Perform Comput Netw Storage Anal. 2020 Nov;2020. doi: 10.1109/sc41405.2020.00042.
We present a Gauss-Newton-Krylov solver for large deformation diffeomorphic image registration. We extend the publicly available CLAIRE library to multi-node multi-graphics processing unit (GPUs) systems and introduce novel algorithmic modifications that significantly improve performance. Our contributions comprise () a new preconditioner for the reduced-space Gauss-Newton Hessian system, () a highly-optimized multi-node multi-GPU implementation exploiting device direct communication for the main computational kernels (interpolation, high-order finite difference operators and Fast-Fourier-Transform), and () a comparison with state-of-the-art CPU and GPU implementations. We solve a 256-resolution image registration problem in five seconds on a single NVIDIA Tesla V100, with a performance speedup of 70% compared to the state-of-the-art. In our largest run, we register 2048 resolution images (25 B unknowns; approximately 152× larger than the largest problem solved in state-of-the-art GPU implementations) on 64 nodes with 256 GPUs on TACC's Longhorn system.
我们提出了一种用于大变形微分同胚图像配准的高斯-牛顿-克里洛夫求解器。我们将公开可用的CLAIRE库扩展到多节点多图形处理单元(GPU)系统,并引入了显著提高性能的新颖算法改进。我们的贡献包括:(1)一种用于降维空间高斯-牛顿海森矩阵系统的新预处理器;(2)一种高度优化的多节点多GPU实现,利用设备直接通信处理主要计算内核(插值、高阶有限差分算子和快速傅里叶变换);(3)与当前最先进的CPU和GPU实现进行比较。在单个NVIDIA Tesla V100上,我们在五秒内解决了一个256分辨率的图像配准问题,与当前最先进技术相比,性能提升了70%。在我们最大规模的运行中,我们在TACC的Longhorn系统上使用64个节点和256个GPU对2048分辨率的图像(250亿个未知数;比当前最先进的GPU实现中解决的最大问题大约大152倍)进行配准。