Liu Rui, Fu Lin, De Man Bruno, Yu Hengyong
Wake Forest University Health Sciences, Winston-Salem, NC 27103 USA.
General Electric Global Research, 1 Research Cycle, Niskayuna, NY 12309 USA.
IEEE Trans Comput Imaging. 2017 Dec;3(4):617-632. doi: 10.1109/TCI.2017.2675705. Epub 2017 Feb 28.
Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm.
投影和反投影操作在CT的各种图像重建和物理校正算法中至关重要。距离驱动(DD)投影和反投影因其高度顺序的内存访问模式和低算术成本而被广泛使用。然而,典型的DD实现有一个内循环,它根据体素和探测器单元边界之间的相对位置来调整计算。分支行为的不规则性使得在图形处理单元(GPU)等大规模并行计算设备上实现效率低下。通过将DD操作分解为三个无分支步骤:积分、线性插值和微分,可以消除这种不规则分支行为,所有这些步骤都非常适合大规模向量化。在本文中,我们实现并评估了一种用于三维锥形束CT的高度并行无分支DD算法。该算法利用GPU上的纹理内存和硬件插值来实现快速计算速度。相对于单线程CPU实现,所开发的无分支DD算法在正向投影中实现了137倍的加速,在反投影中实现了188倍的加速。与最先进的32线程CPU实现相比,所提出的无分支DD在正向投影中实现了8倍的加速,在反投影中实现了10倍的加速。基于GPU的无分支DD方法通过使用模拟和真实数据集的迭代重建算法进行评估。它获得了与CPU参考算法视觉上相同的图像。