Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Department of Lymphoma/Myeloma, The University of Texas MD Anderson Cancer Center, Houston, TX USA.
J Biomed Inform. 2022 Jul;131:104117. doi: 10.1016/j.jbi.2022.104117. Epub 2022 Jun 9.
Data analyses by machine learning (ML) algorithms are gaining popularity in biomedical research. When time-to-event data are of interest, censoring is common and needs to be properly addressed. Most ML methods cannot conveniently and appropriately take the censoring information into consideration, potentially leading to inaccurate or biased results. We aim to develop a general-purpose method for imputing censored survival data, facilitating downstream ML analysis. In this study, we propose a novel method of imputing the survival times for censored observations. The proposal is based on their conditional survival distributions (CondiS) derived from Kaplan-Meier estimators. CondiS can replace censored observations with their best approximations from the statistical model, allowing for direct application of ML methods. When covariates are available, we extend CondiS by incorporating the covariate information through ML modeling (CondiS-X), which further improves the accuracy of the imputed survival time. Compared with existing methods with similar purposes, the proposed methods achieved smaller prediction errors and higher concordance with the underlying true survival times in extensive simulation studies. We also demonstrated the usage and advantages of the proposed methods through two real-world cancer datasets. The major advantage of CondiS is that it allows for the direct application of standard ML techniques for analysis once the censored survival times are imputed. We present a user-friendly R package to implement our method, which is a useful tool for ML-based biomedical research in this era of big data.
数据的机器学习(ML)分析方法在生物医学研究中越来越受欢迎。当关注的是生存时间数据时,删失很常见,需要正确处理。大多数 ML 方法不能方便地、适当地考虑删失信息,这可能导致不准确或有偏差的结果。我们旨在开发一种通用的方法来填补删失的生存数据,为下游的 ML 分析提供便利。在这项研究中,我们提出了一种填补删失观察生存时间的新方法。该方法基于从 Kaplan-Meier 估计器中得到的条件生存分布(CondiS)。CondiS 可以用统计模型中删失观察值的最佳近似值来替换删失观察值,从而可以直接应用 ML 方法。当有协变量时,我们通过 ML 建模(CondiS-X)来扩展 CondiS,将协变量信息纳入其中,进一步提高了所填补的生存时间的准确性。与具有相似目的的现有方法相比,在广泛的模拟研究中,所提出的方法实现了更小的预测误差和与潜在真实生存时间更高的一致性。我们还通过两个真实的癌症数据集展示了所提出方法的使用和优势。CondiS 的主要优势在于,一旦填补了删失的生存时间,它就可以允许直接应用标准的 ML 技术进行分析。我们提供了一个用户友好的 R 包来实现我们的方法,这是大数据时代基于 ML 的生物医学研究的有用工具。