Ren Yang, Wu Dezhi, Singh Avineet, Kasson Erin, Huang Ming, Cavazos-Rehg Patricia
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States.
Department of Integrated Information Technology, University of South Carolina, Columbia, SC, United States.
Front Big Data. 2022 Feb 10;5:770585. doi: 10.3389/fdata.2022.770585. eCollection 2022.
There are increasingly strict regulations surrounding the purchase and use of combustible tobacco products (i.e., cigarettes); simultaneously, the use of other tobacco products, including e-cigarettes (i.e., vaping products), has dramatically increased. However, public attitudes toward vaping vary widely, and the health effects of vaping are still largely unknown. As a popular social media, Twitter contains rich information shared by users about their behaviors and experiences, including opinions on vaping. It is very challenging to identify vaping-related tweets to source useful information manually. In the current study, we proposed to develop a detection model to accurately identify vaping-related tweets using machine learning and deep learning methods. Specifically, we applied seven popular machine learning and deep learning algorithms, including Naïve Bayes, Support Vector Machine, Random Forest, XGBoost, Multilayer Perception, Transformer Neural Network, and stacking and voting ensemble models to build our customized classification model. We extracted a set of sample tweets during an outbreak of e-cigarette or vaping-related lung injury (EVALI) in 2019 and created an annotated corpus to train and evaluate these models. After comparing the performance of each model, we found that the stacking ensemble learning achieved the highest performance with an F1-score of 0.97. All models could achieve 0.90 or higher after tuning hyperparameters. The ensemble learning model has the best average performance. Our study findings provide informative guidelines and practical implications for the automated detection of themed social media data for public opinions and health surveillance purposes.
围绕可燃烟草制品(即香烟)的购买和使用,监管规定日益严格;与此同时,包括电子烟(即雾化产品)在内的其他烟草制品的使用却大幅增加。然而,公众对雾化的态度差异很大,雾化对健康的影响在很大程度上仍不为人知。作为一个广受欢迎的社交媒体,推特包含了用户分享的关于其行为和经历的丰富信息,包括对雾化的看法。手动识别与雾化相关的推文以获取有用信息极具挑战性。在当前的研究中,我们提议开发一种检测模型,使用机器学习和深度学习方法准确识别与雾化相关的推文。具体而言,我们应用了七种流行的机器学习和深度学习算法,包括朴素贝叶斯、支持向量机、随机森林、XGBoost、多层感知器、Transformer神经网络以及堆叠和投票集成模型来构建我们的定制分类模型。我们在2019年电子烟或雾化相关肺损伤(EVALI)爆发期间提取了一组样本推文,并创建了一个带注释的语料库来训练和评估这些模型。在比较每个模型的性能后,我们发现堆叠集成学习的性能最高,F1分数为0.97。在调整超参数后,所有模型都能达到0.90或更高。集成学习模型具有最佳的平均性能。我们的研究结果为出于公众舆论和健康监测目的自动检测主题社交媒体数据提供了信息性指导方针和实际意义。