Suppr超能文献

用于预测健康体检人群非酒精性脂肪性肝病的动态机器学习模型:一项纵向研究。

A dynamic machine learning model for prediction of NAFLD in a health checkup population: A longitudinal study.

作者信息

Deng Yuhan, Ma Yuan, Fu Jingzhu, Wang Xiaona, Yu Canqing, Lv Jun, Man Sailimai, Wang Bo, Li Liming

机构信息

Chongqing Research Institute of Big Data, Peking University, Chongqing, China.

Meinian Institute of Health, Beijing, China.

出版信息

Heliyon. 2023 Jul 27;9(8):e18758. doi: 10.1016/j.heliyon.2023.e18758. eCollection 2023 Aug.

Abstract

BACKGROUND

Non-alcoholic fatty liver disease (NAFLD) is one of the most common liver diseases worldwide. Currently, most NAFLD prediction models are diagnostic models based on cross-sectional data, which failed to provide early identification or clarify causal relationships. We aimed to use time-series deep learning models with longitudinal health checkup records to predict the onset of NAFLD in the future, and update the model stepwise by incorporating new checkup records to achieve dynamic prediction.

METHODS

10,493 participants with over 6 health checkup records from Beijing MJ Health Screening Center were included to conduct a retrospective cohort study, in which the constantly updated initial 5 checkup data were incorporated stepwise to predict the risk of NAFLD at and after their sixth health checkups. A total of 33 variables were considered, consisting of demographic characteristics, medical history, lifestyle, physical examinations, and laboratory tests. L1-penalized logistic regression (LR) was used for feature selection. The long short-term memory (LSTM) algorithm was introduced for model development, and five-fold cross-validation was conducted to tune and choose optimal hyperparameters. Both internal validation and external validation were conducted, using the 20% randomly divided holdout test dataset and previously unseen data from Shanghai MJ Health Screening Center, respectively, to evaluate model performance. The evaluation metrics included area under the receiver operating characteristic curve (AUROC), sensitivity, specificity, Brier score, and decision curve. Bootstrap sampling was implemented to generate 95% confidence intervals of all the metrics. Finally, the Shapley additive explanations (SHAP) algorithm was applied in the holdout test dataset for model interpretability to obtain time-specific and sample-specific contributions of each feature.

RESULTS

Among the 10,493 participants, 1662 (15.84%) were diagnosed with NAFLD at and after their sixth health checkups. The predictive performance of the deep learning model in the internal validation dataset improved over the incorporation of the checkups, with AUROC increasing from 0.729 (95% CI: 0.698,0.760) at baseline to 0.818 (95% CI: 0.798,0.844) when consecutive 5 checkups were included. The external validation dataset, containing 1728 participants, was used to verify the results, in which AUROC increased from 0.700 (95% CI: 0.657,0.740) with only the first checkups to 0.792 (95% CI: 0.758,0.825) with all five. The results of feature significance showed that body fat percentage, alanine transaminase (ALT), and uric acid owned the greatest impact on the outcome, time-specific, individual-specific and dynamic feature contributions were also produced for model interpretability.

CONCLUSION

A dynamic prediction model was successfully established in our study, and the prediction capability kept improving with the renewal of the latest checkup records. In addition, we identified key features associated with the onset of NAFLD, making it possible to optimize the prevention and control strategies of the disease in the general population.

摘要

背景

非酒精性脂肪性肝病(NAFLD)是全球最常见的肝脏疾病之一。目前,大多数NAFLD预测模型是基于横断面数据的诊断模型,无法实现早期识别或阐明因果关系。我们旨在使用具有纵向健康检查记录的时间序列深度学习模型来预测未来NAFLD的发病情况,并通过纳入新的检查记录逐步更新模型以实现动态预测。

方法

纳入来自北京美兆健康体检中心的10493名有6次以上健康检查记录的参与者进行回顾性队列研究,逐步纳入不断更新的最初5次检查数据以预测其第六次及之后健康检查时患NAFLD的风险。共考虑33个变量,包括人口统计学特征、病史、生活方式、体格检查和实验室检查。采用L1惩罚逻辑回归(LR)进行特征选择。引入长短期记忆(LSTM)算法进行模型开发,并进行五折交叉验证以调整和选择最佳超参数。分别使用随机划分的20%保留测试数据集和来自上海美兆健康体检中心的未见数据进行内部验证和外部验证,以评估模型性能。评估指标包括受试者操作特征曲线下面积(AUROC)、敏感性、特异性、布里尔评分和决策曲线。采用Bootstrap抽样生成所有指标的95%置信区间。最后,将Shapley加性解释(SHAP)算法应用于保留测试数据集以进行模型解释,以获得每个特征的时间特异性和样本特异性贡献。

结果

在10493名参与者中,1662名(15.84%)在第六次及之后的健康检查时被诊断为NAFLD。随着检查次数的增加,深度学习模型在内部验证数据集中的预测性能有所提高,AUROC从基线时的0.729(95%CI:0.698,0.760)增加到纳入连续5次检查时的0.818(95%CI:0.798,0.844)。包含1728名参与者的外部验证数据集用于验证结果,其中AUROC从仅第一次检查时的0.700(95%CI:0.657,0.740)增加到所有五次检查时的0.792(95%CI:0.758,0.825)。特征重要性结果显示,体脂百分比、丙氨酸转氨酶(ALT)和尿酸对结果影响最大,还生成了时间特异性、个体特异性和动态特征贡献以进行模型解释。

结论

我们的研究成功建立了一个动态预测模型,并且随着最新检查记录的更新,预测能力不断提高。此外,我们确定了与NAFLD发病相关的关键特征,使得在普通人群中优化该疾病的预防和控制策略成为可能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9af8/10412833/4a2b93a13359/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验