Department of Environmental Health, Harvard T. H. Chan School of Public Health, 401 Park Drive West, Boston, MA, 02215, USA.
Department of Biostatistics, Harvard T. H. Chan School of Public Health, 655 Huntington Avenue, Boston, MA, 02115, USA.
Biostatistics. 2021 Apr 10;22(2):381-401. doi: 10.1093/biostatistics/kxz036.
We propose a computationally and statistically efficient divide-and-conquer (DAC) algorithm to fit sparse Cox regression to massive datasets where the sample size $n_0$ is exceedingly large and the covariate dimension $p$ is not small but $n_0\gg p$. The proposed algorithm achieves computational efficiency through a one-step linear approximation followed by a least square approximation to the partial likelihood (PL). These sequences of linearization enable us to maximize the PL with only a small subset and perform penalized estimation via a fast approximation to the PL. The algorithm is applicable for the analysis of both time-independent and time-dependent survival data. Simulations suggest that the proposed DAC algorithm substantially outperforms the full sample-based estimators and the existing DAC algorithm with respect to the computational speed, while it achieves similar statistical efficiency as the full sample-based estimators. The proposed algorithm was applied to extraordinarily large survival datasets for the prediction of heart failure-specific readmission within 30 days among Medicare heart failure patients.
我们提出了一种计算和统计上高效的分治 (DAC) 算法,用于拟合大规模数据集的稀疏 Cox 回归,其中样本量 $n_0$ 非常大,协变量维度 $p$ 不小,但 $n_0\gg p$。所提出的算法通过一步线性近似和随后对部分似然 (PL) 的最小二乘近似实现计算效率。这些线性化序列使我们能够仅使用一小部分数据集最大化 PL,并通过对 PL 的快速逼近进行惩罚估计。该算法适用于分析时间独立和时间相关的生存数据。模拟表明,与基于全样本的估计器和现有的 DAC 算法相比,所提出的 DAC 算法在计算速度方面有了显著的提高,同时在统计效率上与基于全样本的估计器相当。该算法应用于极其庞大的生存数据集,用于预测 Medicare 心力衰竭患者 30 天内心力衰竭特定再入院的风险。