Wang Liqin, Zhang Shijia, Gao Zhaohong, Jiang Deyou
Heilongjiang University of Chinese Medicine, 24 Heping Road, Xiangfang District, Harbin, 150040, Heilongjiang, China.
The First Affiliated Hospital of Heilongjiang University of Chinese Medicine, 24 Heping Road, Xiangfang District, Harbin, 150040, Heilongjiang, China.
BMC Pulm Med. 2025 Jul 3;25(1):317. doi: 10.1186/s12890-025-03776-w.
Chronic obstructive pulmonary disease (COPD) is a major global public health concern, and early screening and identification of high-risk populations are critical for reducing the disease burden. Although several studies have explored the application of machine learning methods in COPD risk prediction, existing models often have limited feature dimensions and insufficient interpretability. Identifying key risk factors and constructing reliable predictive models remain challenges in clinical practice.
This study aims to integrate multidimensional features based on data from the National Health and Nutrition Examination Survey (NHANES) and to compare the performance of different machine learning models in COPD risk prediction. The goal is to identify the optimal model and enhance its clinical applicability through interpretability analysis.
This study utilized data from the NHANES collected between 2009 and 2018. After systematic feature selection and preprocessing, three models were developed: multivariate binary logistic regression, XGBoost, and Multilayer Perceptron (MLP). Model training and evaluation were performed using stratified five-fold cross-validation. Model performance was comprehensively assessed based on accuracy, precision, recall, F1 score, and the area under the receiver operating characteristic curve (AUC). To enhance model transparency, the SHapley Additive Explanations (SHAP) method was employed to interpret key features and their influence trends within the MLP model.
The MLP model demonstrated the best performance across all evaluation metrics, achieving an average accuracy of 0.937, precision of 0.6624, recall of 0.6535, and F1 score of 0.657 in stratified five-fold cross-validation. The performance gap between the training and testing sets was minimal, indicating no obvious overfitting. SHAP analysis identified smoking years, asthma, age, dietary health status, total protein, red cell distribution width (RDW), BMI, marital status, secondhand smoke exposure, and total bilirubin as important predictive features. Furthermore, dependence plots revealed critical risk inflection points for key continuous variables.
Based on large-scale and multidimensional feature data, this study constructed a COPD risk prediction model with favorable performance and enhanced interpretability. The findings suggest that the MLP model has the potential to effectively identify individuals at high risk for COPD and may offer value in clinical applications. Future studies are warranted to integrate longitudinal follow-up data and multimodal information to further improve predictive accuracy and clinical interpretability, thereby providing a more robust foundation for early screening and personalized interventions in COPD.
慢性阻塞性肺疾病(COPD)是全球主要的公共卫生问题,早期筛查和识别高危人群对于减轻疾病负担至关重要。尽管多项研究探讨了机器学习方法在COPD风险预测中的应用,但现有模型的特征维度往往有限且可解释性不足。识别关键风险因素并构建可靠的预测模型在临床实践中仍然是挑战。
本研究旨在基于美国国家健康与营养检查调查(NHANES)的数据整合多维度特征,并比较不同机器学习模型在COPD风险预测中的性能。目标是识别最优模型,并通过可解释性分析提高其临床适用性。
本研究使用了2009年至2018年期间收集的NHANES数据。经过系统的特征选择和预处理,开发了三个模型:多元二元逻辑回归、XGBoost和多层感知器(MLP)。使用分层五折交叉验证进行模型训练和评估。基于准确率、精确率、召回率、F1分数和受试者工作特征曲线下面积(AUC)对模型性能进行综合评估。为提高模型透明度,采用SHapley值加法解释(SHAP)方法来解释MLP模型中的关键特征及其影响趋势。
在分层五折交叉验证中,MLP模型在所有评估指标上表现最佳,平均准确率为0.937,精确率为0.6624,召回率为0.6535,F1分数为0.657。训练集和测试集之间的性能差距最小,表明没有明显的过拟合。SHAP分析确定吸烟年限、哮喘、年龄、饮食健康状况、总蛋白、红细胞分布宽度(RDW)、BMI、婚姻状况、二手烟暴露和总胆红素为重要的预测特征。此外,依赖图揭示了关键连续变量的关键风险拐点。
基于大规模多维度特征数据,本研究构建了一个性能良好且可解释性增强的COPD风险预测模型。研究结果表明,MLP模型有潜力有效识别COPD高危个体,并可能在临床应用中具有价值。未来的研究有必要整合纵向随访数据和多模态信息,以进一步提高预测准确性和临床可解释性,从而为COPD的早期筛查和个性化干预提供更坚实的基础。