Laboratoire de Probabilités Statistique et Modélisation (LPSM), UMR 8001, Sorbonne University, 4 Place Jussieu, Paris, 75005, France.
Assistance Publique-Hôpitaux de Paris, Biomedical Informatics and Public Health Department, European Georges Pompidou Hospital, 20 Rue Leblanc, Paris, 75015, France.
BMC Med Res Methodol. 2019 Mar 6;19(1):50. doi: 10.1186/s12874-019-0673-4.
Choosing the most performing method in terms of outcome prediction or variables selection is a recurring problem in prognosis studies, leading to many publications on methods comparison. But some aspects have received little attention. First, most comparison studies treat prediction performance and variable selection aspects separately. Second, methods are either compared within a binary outcome setting (where we want to predict whether the readmission will occur within an arbitrarily chosen delay or not) or within a survival analysis setting (where the outcomes are directly the censored times), but not both. In this paper, we propose a comparison methodology to weight up those different settings both in terms of prediction and variables selection, while incorporating advanced machine learning strategies.
Using a high-dimensional case study on a sickle-cell disease (SCD) cohort, we compare 8 statistical methods. In the binary outcome setting, we consider logistic regression (LR), support vector machine (SVM), random forest (RF), gradient boosting (GB) and neural network (NN); while on the survival analysis setting, we consider the Cox Proportional Hazards (PH), the CURE and the C-mix models. We also propose a method using Gaussian Processes to extract meaningfull structured covariates from longitudinal data.
Among all assessed statistical methods, the survival analysis ones obtain the best results. In particular the C-mix model yields the better performances in both the two considered settings (AUC =0.94 in the binary outcome setting), as well as interesting interpretation aspects. There is some consistency in selected covariates across methods within a setting, but not much across the two settings.
It appears that learning withing the survival analysis setting first (so using all the temporal information), and then going back to a binary prediction using the survival estimates gives significantly better prediction performances than the ones obtained by models trained "directly" within the binary outcome setting.
在预后研究中,选择在预测结果或变量选择方面表现最优的方法是一个常见的问题,这导致了许多关于方法比较的文献。但有些方面却很少受到关注。首先,大多数比较研究将预测性能和变量选择方面分开来处理。其次,方法要么在二元结局设置中进行比较(我们希望预测再入院是否会在任意选择的延迟内发生),要么在生存分析设置中进行比较(结局直接是删失时间),但不是两者都比较。在本文中,我们提出了一种比较方法,可以综合考虑预测和变量选择方面的不同设置,同时结合先进的机器学习策略。
我们使用镰状细胞病(SCD)队列的高维案例研究来比较 8 种统计方法。在二元结局设置中,我们考虑逻辑回归(LR)、支持向量机(SVM)、随机森林(RF)、梯度提升(GB)和神经网络(NN);而在生存分析设置中,我们考虑 Cox 比例风险(PH)、CURE 和 C-mix 模型。我们还提出了一种使用高斯过程从纵向数据中提取有意义的结构化协变量的方法。
在所有评估的统计方法中,生存分析方法的表现最好。特别是 C-mix 模型在两个考虑的设置中都取得了更好的结果(二元结局设置中的 AUC =0.94),同时也具有有趣的解释方面。在一个设置中,不同方法之间的协变量选择存在一定的一致性,但在两个设置之间则不太一致。
似乎首先在生存分析设置中进行学习(即使用所有时间信息),然后使用生存估计值回到二元预测,可以显著提高预测性能,优于直接在二元结局设置中训练的模型获得的性能。