通过集成学习和基于SHAP特征分析的特征可解释性优化阿尔茨海默病预测。

Optimizing Alzheimer's disease prediction through ensemble learning and feature interpretability with SHAP-based feature analysis.

作者信息

Hossain Md Kamrul, Ashraf Afrina, Islam Md Mominul, Sourav Shoriful Hassan, Shimul Md Monir Hossain

机构信息

Department of Computer Science and Engineering Daffodil International University Dhaka Bangladesh.

Department of Public Health Daffodil International University Dhaka Bangladesh.

出版信息

Alzheimers Dement (Amst). 2025 Aug 8;17(3):e70162. doi: 10.1002/dad2.70162. eCollection 2025 Jul-Sep.

DOI:10.1002/dad2.70162

PMID:40787633

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12333869/

Abstract

INTRODUCTION

Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia. Early diagnosis is vital. We developed an interpretable machine learning (ML) model for early AD prediction using open clinical data.

METHODS

Data from 2149 adults (60-90 years) were obtained from Kaggle. After preprocessing and feature engineering, tree-based models were trained. A stacking ensemble model combining Gradient Boosting and XGBoost was trained, with Logistic Regression as the meta-learner. SHapley Additive exPlanations (SHAP) provided interpretability. Performance was measured by accuracy, precision, recall, F1 score, ROC and AUC.

RESULTS

The stacked ensemble achieved 97% accuracy (AUC 0.97), with 0.97 precision, 0.94 recall, and 0.96 F1 score for AD. SHAP identified memory complaints, Mini-Mental State Examination (MMSE), functional assessment, behavioral symptoms, cholesterol, and lifestyle factors (activity, diet, sleep) as top predictors.

CONCLUSION

The ensemble model, enhanced by SHAP analysis, provides accurate and interpretable AD risk predictions with potential applicability in future clinical decision support systems.

HIGHLIGHTS

Developed an ensemble machine learning (ML) model for early Alzheimer's disease (AD) prediction.Achieved 97% accuracy using stacked XGBoost and Gradient Boosting.SHapley Additive exPlanations (SHAP) analysis identified key cognitive and lifestyle-related risk factors.Model interprets AD risk using explainable artificial intelligence (AI) for clinical applicability.Utilized open-access dataset to ensure reproducibility and transparency.

摘要

引言

阿尔茨海默病（AD）是一种进行性神经退行性疾病，也是痴呆症的主要病因。早期诊断至关重要。我们利用公开的临床数据开发了一种可解释的机器学习（ML）模型，用于早期AD预测。

方法

从Kaggle获取了2149名成年人（60 - 90岁）的数据。经过预处理和特征工程后，训练基于树的模型。训练了一个结合梯度提升和XGBoost的堆叠集成模型，以逻辑回归作为元学习器。SHapley值加法解释（SHAP）提供可解释性。通过准确率、精确率、召回率、F1分数、ROC和AUC来衡量性能。

结果

堆叠集成模型的准确率达到97%（AUC为0.97），AD的精确率为0.97，召回率为0.94，F1分数为0.96。SHAP将记忆问题、简易精神状态检查表（MMSE）、功能评估、行为症状、胆固醇和生活方式因素（活动、饮食、睡眠）确定为主要预测因素。

结论

通过SHAP分析增强的集成模型提供了准确且可解释的AD风险预测，在未来临床决策支持系统中具有潜在的适用性。

要点

开发了一种用于早期阿尔茨海默病（AD）预测的集成机器学习（ML）模型。使用堆叠的XGBoost和梯度提升实现了97%的准确率。SHapley值加法解释（SHAP）分析确定了关键的认知和生活方式相关风险因素。该模型使用可解释人工智能（AI）解释AD风险以用于临床应用。利用开放获取数据集确保可重复性和透明度。