IEEE Trans Ultrason Ferroelectr Freq Control. 2017 Oct;64(10):1465-1477. doi: 10.1109/TUFFC.2017.2731944. Epub 2017 Jul 25.
Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
模拟超声数据是开发和验证超声心动图定量图像分析方法的一种有价值的工具。不幸的是,对于由大量点散射体组成的幻影,模拟时间可能会变得非常长。高等人提出的 COLE 算法是一种快速的基于卷积的模拟器,它以牺牲模拟精度为代价提高了速度。我们提出了 COLE 算法的高效并行 CPU 和 GPU 实现,重点是涉及移动点散射体的动态模拟。我们认为,通过从 CPU 到 GPU 的最小数据传输量来实现良好的 GPU 性能是至关重要的。我们通过将动态点散射体的完整轨迹存储为 GPU 内存中的样条曲线来实现这一点。这使得在模拟包含大量帧的序列(例如全心脏周期的 B 模式和组织多普勒数据)时具有良好的效率。此外,我们提出了一种基于相位的子采样延迟技术,当 COLE 没有足够的时间过采样时,该技术可以有效地消除 B 模式序列中看到的闪烁伪影。为了评估性能,我们使用了一台笔记本电脑和一台台式电脑,每台电脑都配备了多核 Intel CPU 和 NVIDIA GPU。在高端 TITAN X GPU 上运行模拟器,与并行 CPU 版本相比,我们观察到了两个数量级的加速,与高等人在 COLE 论文中报告的模拟时间相比,加速了三个数量级,与 Jensen 等人在一篇论文中报告的 Field II 的多线程版本相比,加速了 27000 倍。我们希望通过将模拟器作为一个开源项目发布,我们将鼓励其使用和进一步发展。