Ko Seyoon, Zhou Hua, Zhou Jin J, Won Joong-Ho
Department of Biostatistics, UCLA Fielding School of Public Health, Los Angeles, California 90095, USA.
Department of Medicine, UCLA David Geffen School of Medicine, Los Angeles, California 90095, USA, and Department of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, Arizona 85724, USA.
Stat Sci. 2022 Nov;37(4):494-518. doi: 10.1214/21-sts835. Epub 2022 Oct 13.
Technological advances in the past decade, hardware and software alike, have made access to high-performance computing (HPC) easier than ever. We review these advances from a statistical computing perspective. Cloud computing makes access to supercomputers affordable. Deep learning software libraries make programming statistical algorithms easy and enable users to write code once and run it anywhere-from a laptop to a workstation with multiple graphics processing units (GPUs) or a supercomputer in a cloud. Highlighting how these developments benefit statisticians, we review recent optimization algorithms that are useful for high-dimensional models and can harness the power of HPC. Code snippets are provided to demonstrate the ease of programming. We also provide an easy-to-use distributed matrix data structure suitable for HPC. Employing this data structure, we illustrate various statistical applications including large-scale positron emission tomography and -regularized Cox regression. Our examples easily scale up to an 8-GPU workstation and a 720-CPU-core cluster in a cloud. As a case in point, we analyze the onset of type-2 diabetes from the UK Biobank with 200,000 subjects and about 500,000 single nucleotide polymorphisms using the HPC -regularized Cox regression. Fitting this half-million-variate model takes less than 45 minutes and reconfirms known associations. To our knowledge, this is the first demonstration of the feasibility of penalized regression of survival outcomes at this scale.
在过去十年中,无论是硬件还是软件,技术进步都使高性能计算(HPC)的使用比以往任何时候都更加容易。我们从统计计算的角度回顾这些进展。云计算使使用超级计算机变得经济实惠。深度学习软件库使统计算法的编程变得容易,并能让用户编写一次代码就能在任何地方运行——从笔记本电脑到配备多个图形处理单元(GPU)的工作站,或云端的超级计算机。在强调这些发展如何使统计学家受益的同时,我们回顾了最近对高维模型有用且能利用HPC能力的优化算法。提供代码片段以展示编程的简易性。我们还提供了一种适用于HPC的易于使用的分布式矩阵数据结构。使用这种数据结构,我们展示了各种统计应用,包括大规模正电子发射断层扫描和 -正则化Cox回归。我们的示例可以轻松扩展到8-GPU工作站和云端的720-CPU核心集群。例如,我们使用HPC -正则化Cox回归分析了来自英国生物银行的20万名受试者和大约50万个单核苷酸多态性的2型糖尿病发病情况。拟合这个五十万变量的模型耗时不到45分钟,并再次证实了已知的关联。据我们所知,这是首次证明在这种规模下生存结果的惩罚回归的可行性。