Department of Bio-lndustrial Mechatronics Engineering, National Taiwan University, Taipei 106, Taiwan.
Med Phys. 2011 Jul;38(7):4052-65. doi: 10.1118/1.3591994.
Iterative reconstruction techniques hold great potential to mitigate the effects of data noise and/or incompleteness, and hence can facilitate the patient dose reduction. However, they are not suitable for routine clinical practice due to their long reconstruction times. In this work, the authors accelerated the computations by fully taking advantage of the highly parallel computational power on single and multiple graphics processing units (GPUs). In particular, the forward projection algorithm, which is not included in the close-form formulas, will be accelerated and optimized by using GPU here.
The main contribution is a novel forward projection algorithm that uses multithreads to handle the computations associated with a bunch of adjacent rays simultaneously. The proposed algorithm is free of divergence and bank conflict on GPU, and benefits from data locality and data reuse. It achieves the efficiency particularly by (i) employing a tiled algorithm with three-level parallelization, (ii) optimizing thread block size, (iii) maximizing data reuse on constant memory and shared memory, and (iv) exploiting built-in texture memory interpolation capability to increase efficiency. In addition, to accelerate the iterative algorithms and the Feldkamp-Davis-Kress (FDK) algorithm on GPU, the authors apply batched fast Fourier transform (FFT) to expedite filtering process in FDK and utilize projection bundling parallelism during backprojection to shorten the execution times in FDK and the expectation-maximization (EM).
Numerical experiments conducted on an NVIDIA Tesla C1060 GPU demonstrated the superiority of the proposed algorithms in computational time saving. The forward projection, filtering, and backprojection times for generating a volume image of 512 x 512 x 512 with 360 projection data of 512 x 512 using one GPU are about 4.13, 0.65, and 2.47 s (including distance weighting), respectively. In particular, the proposed forward projection algorithm is ray-driven and its paralleli-zation strategy evolves from single-thread-for-single-ray (38.56 s), multithreads-for-single-ray (26.05 s), to multithreads-for-multirays (4.13 s). For the voxel-driven backprojection, the use of texture memory reduces the reconstruction time from 4.95 to 3.35 s. By applying the projection bundle technique, the computation time is further reduced to 2.47 s. When employing multiple GPUs, near-perfect speedups were observed as the number of GPUs increases. For example, by using four GPUs, the time for the forward projection, filtering, and backprojection are further reduced to 1.11, 0.18, and 0.66 s. The results obtained by GPU-based algorithms are virtually indistinguishable with those by CPU.
The authors have proposed a highly optimized GPU-based forward projection algorithm, as well as the GPU-based FDK and expectation-maximization reconstruction algorithms. Our compute unified device architecture (CUDA) codes provide the exceedingly fast forward projection and backprojection that outperform those using the shading languages, cell broadband engine architecture and previous CUDA implementations. The reconstruction times in the FDK and the EM algorithms were considerably shortened, and thus can facilitate their routine usage in a variety of applications such as image quality improvement and dose reduction.
迭代重建技术具有减轻数据噪声和/或不完整性影响的巨大潜力,因此可以帮助降低患者的剂量。然而,由于其重建时间较长,它们不适合常规临床实践。在这项工作中,作者充分利用单 GPU 和多 GPU 的高度并行计算能力来加速计算。特别是,这里将使用 GPU 加速和优化不包括在闭式公式中的正向投影算法。
主要贡献是一种新颖的正向投影算法,它使用多线程同时处理与一束相邻射线相关的计算。所提出的算法在 GPU 上没有散度和银行冲突的问题,并且受益于数据局部性和数据重用。它通过以下方式特别实现效率:(i)采用具有三级并行化的平铺算法,(ii)优化线程块大小,(iii)在常数内存和共享内存上最大化数据重用,以及(iv)利用内置纹理内存插值功能提高效率。此外,为了在 GPU 上加速迭代算法和 Feldkamp-Davis-Kress(FDK)算法,作者应用批处理快速傅里叶变换(FFT)来加速 FDK 中的滤波过程,并在反向投影中利用投影束并行性来缩短 FDK 和期望最大化(EM)中的执行时间。
在 NVIDIA Tesla C1060 GPU 上进行的数值实验表明,所提出的算法在节省计算时间方面具有优越性。使用一个 GPU 生成 512 x 512 x 512 体积图像,使用 512 x 512 的 360 个投影数据,正向投影、滤波和反向投影的时间分别约为 4.13、0.65 和 2.47 秒(包括距离加权)。特别是,所提出的正向投影算法是射线驱动的,其并行化策略从单线程-单射线(38.56 秒)、多线程-单射线(26.05 秒)发展到多线程-多射线(4.13 秒)。对于体素驱动的反向投影,使用纹理内存将重建时间从 4.95 秒减少到 3.35 秒。通过应用投影束技术,计算时间进一步减少到 2.47 秒。当使用多个 GPU 时,随着 GPU 数量的增加,可以观察到近乎完美的加速效果。例如,使用四个 GPU 时,正向投影、滤波和反向投影的时间进一步减少到 1.11、0.18 和 0.66 秒。GPU 算法得到的结果与 CPU 算法几乎无法区分。
作者提出了一种高度优化的基于 GPU 的正向投影算法,以及基于 GPU 的 FDK 和期望最大化重建算法。我们的计算统一设备架构(CUDA)代码提供了非常快速的正向投影和反向投影,优于使用着色语言、Cell Broadband Engine 架构和以前的 CUDA 实现的那些。FDK 和 EM 算法的重建时间大大缩短,从而可以促进它们在各种应用中的常规使用,例如图像质量改善和剂量降低。