Liu Shiqi, Xie Zilong, Zheng Ming, Yu Wen
Department of Statistics and Data Science, School of Management, Fudan University, Shanghai, People's Republic of China.
School of Mathematical Sciences, Fudan University, Shanghai, People's Republic of China.
J Appl Stat. 2024 Nov 4;52(7):1315-1341. doi: 10.1080/02664763.2024.2423234. eCollection 2025.
Subsampling designs are useful for reducing computational load and storage cost for large-scale data analysis. For massive survival data with right censoring, we propose a class of optimal subsampling designs under the widely-used Cox model. The proposed designs utilize information from both the outcome and the covariates. Different forms of the design can be derived adaptively to meet various targets, such as optimizing the overall estimation accuracy or minimizing the variation of specific linear combination of the estimators. Given the subsampled data, the inverse probability weighting approach is employed to estimate the model parameters. The resultant estimators are shown to be consistent and asymptotically normally distributed. Simulation results indicate that the proposed subsampling design yields more efficient estimators than the uniform subsampling by using subsampled data of comparable sample sizes. Additionally, the subsampling estimation significantly reduces the computational load and storage cost relative to the full data estimation. An analysis of a real data example is provided for illustration.
子抽样设计对于大规模数据分析中减少计算量和存储成本很有用。对于带有右删失的海量生存数据,我们在广泛使用的Cox模型下提出了一类最优子抽样设计。所提出的设计利用了来自结果和协变量两方面的信息。可以自适应地导出不同形式的设计以满足各种目标,比如优化整体估计精度或最小化估计量特定线性组合的方差。给定子抽样数据后,采用逆概率加权方法来估计模型参数。结果表明所得估计量是一致的且渐近正态分布。模拟结果表明,通过使用具有可比样本量的子抽样数据,所提出的子抽样设计比均匀子抽样产生更有效的估计量。此外,相对于全数据估计,子抽样估计显著降低了计算量和存储成本。提供了一个实际数据例子的分析用于说明。