Suppr超能文献

基于机器学习的慢性阻塞性肺疾病不平衡数据风险预测

Machine learning-enabled risk prediction of chronic obstructive pulmonary disease with unbalanced data.

作者信息

Wang Xuchun, Ren Hao, Ren Jiahui, Song Wenzhu, Qiao Yuchao, Ren Zeping, Zhao Ying, Linghu Liqin, Cui Yu, Zhao Zhiyang, Chen Limin, Qiu Lixia

机构信息

Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China.

Shanxi Centre for Disease Control and Prevention, Taiyuan, Shanxi 030012, China.

出版信息

Comput Methods Programs Biomed. 2023 Mar;230:107340. doi: 10.1016/j.cmpb.2023.107340. Epub 2023 Jan 6.

Abstract

BACKGROUND AND OBJECTIVE

Since the early symptoms of chronic obstructive pulmonary disease (COPD) are not obvious, patients are not easily identified, causing improper time for prevention and treatment. In present study, machine learning (ML) methods were employed to construct a risk prediction model for COPD to improve its prediction efficiency.

METHODS

We collected data from a sample of 5807 cases with a complete COPD diagnosis from the 2019 COPD Surveillance Program in Shanxi Province and extracted 34 potentially relevant variables from the dataset. Firstly, we used feature selection methods (i.e., Generalized elastic net, Lasso and Adaptive lasso) to select ten variables. Afterwards, we employed supervised classifiers for class imbalanced data by combining the cost-sensitive learning and SMOTE resampling methods with the ML methods (Logistic Regression, SVM, Random Forest, XGBoost, LightGBM, NGBoost and Stacking), respectively. Last, we assessed their performance.

RESULTS

The cough frequently at age 14 and before and other 9 variables are significant parameters for COPD. The Stacking heterogeneous ensemble model showed relatively good performance in the unbalanced datasets. The Logistic Regression with class weighting enjoyed the best classification performance in the balancing data when these composite indicators (AUC, F1-Score and G-mean) were used as criteria for model comparison. The values of F1-Score and G-mean for the top three ML models were 0.290/0.660 for Logistic Regression with class weighting, 0.288/0.649 for Stacking with synthetic minority oversampling technique (SMOTE), and 0.285/0.648 for LightGBM with SMOTE.

CONCLUSIONS

This paper combining feature selection methods, unbalanced data processing methods and machine learning methods with data from disease surveillance questionnaires and physical measurements to identify people at risk of COPD, concluded that machine learning models based on survey questionnaires could provide an automated identification for patients at risk of COPD, and provide a simple and scientific aid for early identification of COPD.

摘要

背景与目的

由于慢性阻塞性肺疾病(COPD)的早期症状不明显,患者不易被识别,导致防治时机不当。在本研究中,采用机器学习(ML)方法构建COPD风险预测模型,以提高其预测效率。

方法

我们从山西省2019年COPD监测项目中收集了5807例确诊为COPD的病例数据,并从数据集中提取了34个潜在相关变量。首先,我们使用特征选择方法(即广义弹性网、套索和自适应套索)选择了10个变量。之后,我们分别将成本敏感学习和SMOTE重采样方法与ML方法(逻辑回归、支持向量机、随机森林、XGBoost、LightGBM、NGBoost和Stacking)相结合,用于处理类别不平衡数据的监督分类器。最后,我们评估了它们的性能。

结果

14岁及以前频繁咳嗽等9个变量是COPD的重要参数。Stacking异构集成模型在不平衡数据集中表现出相对较好的性能。当使用这些综合指标(AUC、F1分数和G均值)作为模型比较标准时,带类别权重的逻辑回归在平衡数据中具有最佳的分类性能。前三个ML模型的F1分数和G均值分别为:带类别权重的逻辑回归为0.290/0.660,使用合成少数过采样技术(SMOTE)的Stacking为0.288/0.649,使用SMOTE的LightGBM为0.285/0.648。

结论

本文将特征选择方法、不平衡数据处理方法和机器学习方法与疾病监测问卷和体格测量数据相结合,以识别COPD风险人群,得出基于调查问卷的机器学习模型可为COPD风险患者提供自动识别,并为COPD的早期识别提供简单科学的辅助手段。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验