Suppr超能文献

在ProtoDUNE数据处理中使用GPU加速机器学习推理

Accelerating Machine Learning Inference with GPUs in ProtoDUNE Data Processing.

作者信息

Cai Tejin, Herner Kenneth, Yang Tingjun, Wang Michael, Acosta Flechas Maria, Harris Philip, Holzman Burt, Pedro Kevin, Tran Nhan

机构信息

Department of Physics and Astronomy, York University, 4700 Keele Street, Toronto, M3J 1P3 ON Canada.

Fermi National Accelerator Laboratory, Kirk Road and Pine Streets, Batavia, 60510 IL USA.

出版信息

Comput Softw Big Sci. 2023;7(1):11. doi: 10.1007/s41781-023-00101-0. Epub 2023 Oct 27.

Abstract

We study the performance of a cloud-based GPU-accelerated inference server to speed up event reconstruction in neutrino data batch jobs. Using detector data from the ProtoDUNE experiment and employing the standard DUNE grid job submission tools, we attempt to reprocess the data by running several thousand concurrent grid jobs, a rate we expect to be typical of current and future neutrino physics experiments. We process most of the dataset with the GPU version of our processing algorithm and the remainder with the CPU version for timing comparisons. We find that a 100-GPU cloud-based server is able to easily meet the processing demand, and that using the GPU version of the event processing algorithm is two times faster than processing these data with the CPU version when comparing to the newest CPUs in our sample. The amount of data transferred to the inference server during the GPU runs can overwhelm even the highest-bandwidth network switches, however, unless care is taken to observe network facility limits or otherwise distribute the jobs to multiple sites. We discuss the lessons learned from this processing campaign and several avenues for future improvements.

摘要

我们研究了基于云的GPU加速推理服务器在中微子数据批处理作业中加速事件重建的性能。利用来自ProtoDUNE实验的探测器数据,并使用标准的DUNE网格作业提交工具,我们尝试通过运行数千个并发网格作业来重新处理数据,我们预计这一速率是当前和未来中微子物理实验的典型速率。我们使用处理算法的GPU版本处理大部分数据集,其余部分使用CPU版本进行计时比较。我们发现,基于100个GPU的云服务器能够轻松满足处理需求,并且与我们样本中最新的CPU相比,使用事件处理算法的GPU版本比使用CPU版本处理这些数据快两倍。然而,在GPU运行期间传输到推理服务器的数据量甚至可能使最高带宽的网络交换机不堪重负,除非注意遵守网络设施限制或以其他方式将作业分发到多个站点。我们讨论了从这次处理活动中学到的经验教训以及未来改进的几个途径。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f30f/10611601/69c3a3522080/41781_2023_101_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验