West Virginia Clinical and Translational Science Institute, Morgantown, WV 26506, USA.
Department of Biostatistics and Epidemiology, College of Public Health, East Tennessee State University, Johnson City, TN 37614, USA.
Int J Environ Res Public Health. 2024 Nov 6;21(11):1474. doi: 10.3390/ijerph21111474.
Feature selection is essentially the process of picking informative and relevant features from a larger collection of features. Few studies have focused on predictors for current e-cigarette use among U.S. adults using feature selection and machine learning (ML) approaches. This study aimed to perform feature selection and develop ML approaches in prediction of current e-cigarette use using the 2022 Health Information National Trends Survey (HINTS 6). The Boruta algorithm and the least absolute shrinkage and selection operator (LASSO) were used to perform feature selection of 71 variables. The random oversampling example (ROSE) method was utilized to deal with imbalance data. Five ML tools including support vector machines (SVMs), logistic regression (LR), random forest (RF), gradient boosting machine (GBM), and extreme gradient boosting (XGBoost) were applied to develop ML models. The overall prevalence of current e-cigarette use was 4.3%. Using the overlapped 15 variables selected by Boruta and LASSO, the RF algorithm provided the best classifier with an accuracy of 0.992, sensitivity of 0.985, F1 score of 0.991, and AUC of 0.999. Weighted logistic regression further confirmed that age, education level, smoking status, belief in the harm of e-cigarette use, binge drinking, belief in alcohol increasing cancer, and the Patient Health Questionnaire-4 (PHQ4) score were associated with e-cigarette use. This study confirmed the strength of ML techniques in survey data, and the findings will guide inquiry into behaviors and mentalities of substance users.
特征选择本质上是从大量特征中挑选信息丰富且相关的特征的过程。很少有研究使用特征选择和机器学习 (ML) 方法来关注美国成年人当前电子烟使用的预测因子。本研究旨在使用 2022 年健康信息国家趋势调查 (HINTS 6) 通过特征选择和开发 ML 方法来预测当前电子烟的使用情况。Boruta 算法和最小绝对值收缩和选择算子 (LASSO) 用于对 71 个变量进行特征选择。随机过采样示例 (ROSE) 方法用于处理不平衡数据。使用支持向量机 (SVMs)、逻辑回归 (LR)、随机森林 (RF)、梯度提升机 (GBM) 和极端梯度提升 (XGBoost) 这 5 种 ML 工具来开发 ML 模型。当前电子烟使用率为 4.3%。使用 Boruta 和 LASSO 选择的重叠 15 个变量,RF 算法提供了最佳分类器,准确率为 0.992、灵敏度为 0.985、F1 得分为 0.991、AUC 为 0.999。加权逻辑回归进一步证实,年龄、教育程度、吸烟状况、对电子烟使用危害的信念、狂饮、对酒精增加癌症的信念和患者健康问卷-4 (PHQ4) 评分与电子烟使用相关。本研究证实了 ML 技术在调查数据中的强大功能,研究结果将指导对物质使用者行为和心态的探究。