Warwick Clinical Trials Unit, University of Warwick, Coventry, UK.
Høyskolen Kristiania, Oslo, Norway.
BMC Med Res Methodol. 2024 Nov 29;24(1):295. doi: 10.1186/s12874-024-02390-4.
Whether machine learning approaches are superior to classical statistical models for survival analyses, especially in the case of lack of proportionality, is unknown.
To compare model performance and predictive accuracy of classic regressions and machine learning approaches using data from the Inspiring Families programme.
The Inspiring Families programme aims to support members of families with complex issues to return to work. We explored predictors of time to return to work with proportional hazards (Semi-Parametric Cox in Stata) and (Flexible Parametric Parmar-Royston in Stata) against the Survival penalised regression with Elastic Net penalty (scikit-survival), (conditional) Survival Forest algorithm (pySurvival), and (kernel) Survival Support Vector Machine (pySurvival).
At baseline we obtained data on 61 binary variables from all 3161 participants. No model appeared superior, with a low predictive power (concordance index between 0.51 and 0.61). The median time for finding the first job was about 254 days. The top five contributing variables were 'family issues and additional barriers', 'restriction of hours', 'available CV', 'self-employment considered' and 'education'. The Harrell's Concordance index was range from 0.60 (Cox model) to 0.71 (Random Survival Forest) suggesting a better fit for the machine learning approaches. However, the comparison for predicting median time on a selected scenario based showed only minor differences.
Implementing a series of survival models with and without proportional hazards background provides a useful insight as well as better interpretation of the coefficients affected by non-linearities. However, that better fit does not translate to substantially higher predictive power and accuracy from using machine learning approaches. Further tuning of the machine learning algorithms may provide improved results.
机器学习方法在生存分析中是否优于经典的统计模型,尤其是在缺乏比例性的情况下,尚不清楚。
使用 Inspiring Families 计划的数据比较经典回归和机器学习方法的模型性能和预测准确性。
Inspiring Families 计划旨在支持有复杂问题的家庭重返工作岗位。我们使用比例风险(Stata 中的半参数 Cox)和(Stata 中的灵活参数 Parmar-Royston)来探索返回工作的时间预测因素,与使用弹性网络惩罚的生存惩罚回归(scikit-survival)、(条件)生存森林算法(pySurvival)和(核)生存支持向量机(pySurvival)进行比较。
在基线时,我们从所有 3161 名参与者中获得了 61 个二进制变量的数据。没有一种模型表现出优势,预测能力较低(一致性指数在 0.51 到 0.61 之间)。找到第一份工作的中位数时间约为 254 天。前五个有贡献的变量是“家庭问题和额外障碍”、“工时限制”、“可用简历”、“考虑自营职业”和“教育”。Harrell 的一致性指数范围从 0.60(Cox 模型)到 0.71(随机生存森林),表明机器学习方法的拟合更好。然而,基于选定场景的中位数时间预测比较仅显示出较小的差异。
实施一系列带有和不带有比例性背景的生存模型提供了有用的见解,并更好地解释了受非线性影响的系数。然而,使用机器学习方法并不能显著提高预测能力和准确性。进一步调整机器学习算法可能会提供更好的结果。