Database and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea.
Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh 700000, Vietnam.
Int J Environ Res Public Health. 2020 Sep 7;17(18):6513. doi: 10.3390/ijerph17186513.
Smoking-induced noncommunicable diseases (SiNCDs) have become a significant threat to public health and cause of death globally. In the last decade, numerous studies have been proposed using artificial intelligence techniques to predict the risk of developing SiNCDs. However, determining the most significant features and developing interpretable models are rather challenging in such systems. In this study, we propose an efficient extreme gradient boosting (XGBoost) based framework incorporated with the hybrid feature selection (HFS) method for SiNCDs prediction among the general population in South Korea and the United States. Initially, HFS is performed in three stages: (I) significant features are selected by t-test and chi-square test; (II) multicollinearity analysis serves to obtain dissimilar features; (III) final selection of best representative features is done based on least absolute shrinkage and selection operator (LASSO). Then, selected features are fed into the XGBoost predictive model. The experimental results show that our proposed model outperforms several existing baseline models. In addition, the proposed model also provides important features in order to enhance the interpretability of the SiNCDs prediction model. Consequently, the XGBoost based framework is expected to contribute for early diagnosis and prevention of the SiNCDs in public health concerns.
吸烟引起的非传染性疾病(SiNCDs)已成为全球公共卫生的重大威胁和主要死因。在过去十年中,已经提出了许多使用人工智能技术来预测 SiNCDs 发病风险的研究。然而,在这些系统中确定最重要的特征并开发可解释的模型是相当具有挑战性的。在这项研究中,我们提出了一种有效的基于极端梯度提升(XGBoost)的框架,并结合混合特征选择(HFS)方法,用于预测韩国和美国一般人群中的 SiNCDs。首先,HFS 分三个阶段进行:(I)通过 t 检验和卡方检验选择显著特征;(II)进行多线性分析以获得不相似的特征;(III)基于最小绝对值收缩和选择算子(LASSO)进行最佳代表性特征的最终选择。然后,选择的特征被输入 XGBoost 预测模型。实验结果表明,我们提出的模型优于几个现有的基线模型。此外,所提出的模型还提供了重要的特征,以增强 SiNCDs 预测模型的可解释性。因此,基于 XGBoost 的框架有望为公共卫生关注的 SiNCDs 的早期诊断和预防做出贡献。