多变量纵向数据在年轻人心血管事件预测生存分析中的应用:来自可解释性比较研究的启示。

Multivariate longitudinal data for survival analysis of cardiovascular event prediction in young adults: insights from a comparative explainable study.

机构信息

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.

Department of Cardiology, Johns Hopkins University, Baltimore, MD, USA.

出版信息

BMC Med Res Methodol. 2023 Jan 25;23(1):23. doi: 10.1186/s12874-023-01845-4.

Abstract

BACKGROUND

Multivariate longitudinal data are under-utilized for survival analysis compared to cross-sectional data (CS - data collected once across cohort). Particularly in cardiovascular risk prediction, despite available methods of longitudinal data analysis, the value of longitudinal information has not been established in terms of improved predictive accuracy and clinical applicability.

METHODS

We investigated the value of longitudinal data over and above the use of cross-sectional data via 6 distinct modeling strategies from statistics, machine learning, and deep learning that incorporate repeated measures for survival analysis of the time-to-cardiovascular event in the Coronary Artery Risk Development in Young Adults (CARDIA) cohort. We then examined and compared the use of model-specific interpretability methods (Random Survival Forest Variable Importance) and model-agnostic methods (SHapley Additive exPlanation (SHAP) and Temporal Importance Model Explanation (TIME)) in cardiovascular risk prediction using the top-performing models.

RESULTS

In a cohort of 3539 participants, longitudinal information from 35 variables that were repeatedly collected in 6 exam visits over 15 years improved subsequent long-term (17 years after) risk prediction by up to 8.3% in C-index compared to using baseline data (0.78 vs. 0.72), and up to approximately 4% compared to using the last observed CS data (0.75). Time-varying AUC was also higher in models using longitudinal data (0.86-0.87 at 5 years, 0.79-0.81 at 10 years) than using baseline or last observed CS data (0.80-0.86 at 5 years, 0.73-0.77 at 10 years). Comparative model interpretability analysis revealed the impact of longitudinal variables on model prediction on both the individual and global scales among different modeling strategies, as well as identifying the best time windows and best timing within that window for event prediction. The best strategy to incorporate longitudinal data for accuracy was time series massive feature extraction, and the easiest interpretable strategy was trajectory clustering.

CONCLUSION

Our analysis demonstrates the added value of longitudinal data in predictive accuracy and epidemiological utility in cardiovascular risk survival analysis in young adults via a unified, scalable framework that compares model performance and explainability. The framework can be extended to a larger number of variables and other longitudinal modeling methods.

TRIAL REGISTRATION

ClinicalTrials.gov Identifier: NCT00005130, Registration Date: 26/05/2000.

摘要

背景

与横断面数据(CS-在队列中一次性收集的数据)相比,多变量纵向数据在生存分析中未得到充分利用。特别是在心血管风险预测中,尽管有纵向数据分析方法,但尚未确定纵向信息在提高预测准确性和临床适用性方面的价值。

方法

我们通过统计、机器学习和深度学习中的 6 种不同建模策略,调查了纵向数据相对于使用横断面数据的价值,这些策略包括重复测量,用于分析冠状动脉风险发展中的年轻人(CARDIA)队列中发生心血管事件的时间。然后,我们使用特定于模型的可解释性方法(随机生存森林变量重要性)和模型不可知的方法(SHapley Additive exPlanation(SHAP)和Temporal Importance Model Explanation(TIME))在心血管风险预测中检查和比较了使用表现最佳的模型的方法。

结果

在 3539 名参与者的队列中,在 15 年的 6 次检查中重复收集的 35 个变量的纵向信息,与使用基线数据(0.78 对 0.72)相比,在 C 指数中最多可将后续长期(17 年后)风险预测提高 8.3%,与使用最后一次观察到的 CS 数据(0.75)相比,最多可提高约 4%。使用纵向数据的模型的时间变化 AUC 也更高(5 年时为 0.86-0.87,10 年时为 0.79-0.81),而使用基线或最后观察到的 CS 数据(5 年时为 0.80-0.86,10 年时为 0.73-0.77)。比较模型的可解释性分析表明,在不同的建模策略中,纵向变量对个体和整体规模上的模型预测的影响,以及确定最佳的时间窗口和该窗口内用于事件预测的最佳时间。用于准确性的最佳纵向数据纳入策略是时间序列海量特征提取,最易于解释的策略是轨迹聚类。

结论

我们的分析通过统一的可扩展框架,比较了模型性能和可解释性,证明了纵向数据在年轻成年人心血管风险生存分析中的预测准确性和流行病学效用方面的附加价值。该框架可以扩展到更多变量和其他纵向建模方法。

试验注册

ClinicalTrials.gov 标识符:NCT00005130,注册日期:2000 年 5 月 26 日。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c63/9878947/037dd07b0a0b/12874_2023_1845_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索