Department of Statistics, University of Connecticut, Storrs, Connecticut, USA.
Stat Med. 2024 Oct 30;43(24):4650-4666. doi: 10.1002/sim.10200. Epub 2024 Aug 20.
Subsampling is a practical strategy for analyzing vast survival data, which are progressively encountered across diverse research domains. While the optimal subsampling method has been applied to inferences for Cox models and parametric accelerated failure time (AFT) models, its application to semi-parametric AFT models with rank-based estimation have received limited attention. The challenges arise from the non-smooth estimating function for regression coefficients and the seemingly zero contribution from censored observations in estimating functions in the commonly seen form. To address these challenges, we develop optimal subsampling probabilities for both event and censored observations by expressing the estimating functions through a well-defined stochastic process. Meanwhile, we apply an induced smoothing procedure to the non-smooth estimating functions. As the optimal subsampling probabilities depend on the unknown regression coefficients, we employ a two-step procedure to obtain a feasible estimation method. An additional benefit of the method is its ability to resolve the issue of underestimation of the variance when the subsample size approaches the full sample size. We validate the performance of our estimators through a simulation study and apply the methods to analyze the survival time of lymphoma patients in the surveillance, epidemiology, and end results program.
抽样是分析大规模生存数据的一种实用策略,这些数据在不同的研究领域中逐渐出现。虽然已经将最优抽样方法应用于 Cox 模型和参数加速失效时间 (AFT) 模型的推断中,但将其应用于基于秩的估计的半参数 AFT 模型却受到了有限的关注。挑战来自于回归系数的非平滑估计函数以及在常见形式的估计函数中,由于截尾观察值的贡献似乎为零。为了解决这些挑战,我们通过定义良好的随机过程来表示估计函数,从而为事件和截尾观察值开发最优抽样概率。同时,我们将诱导平滑过程应用于非平滑估计函数。由于最优抽样概率取决于未知的回归系数,因此我们采用两步法来获得可行的估计方法。该方法的另一个优点是,当子样本大小接近总样本大小时,它能够解决方差低估的问题。我们通过模拟研究验证了我们的估计量的性能,并将这些方法应用于分析监测、流行病学和结果计划中的淋巴瘤患者的生存时间。