使用BERTweet对推特上的电子烟话语进行分类：比较深度学习研究。

Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study.

作者信息

Baker William, Colditz Jason B, Dobbs Page D, Mai Huy, Visweswaran Shyam, Zhan Justin, Primack Brian A

机构信息

Department of Computer Science and Computer Engineering, University of Arkansas, Fayetteville, AR, United States.

Division of General Internal Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States.

出版信息

JMIR Med Inform. 2022 Jul 21;10(7):e33678. doi: 10.2196/33678.

DOI:10.2196/33678

PMID:35862172

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9353682/

Abstract

BACKGROUND

Twitter provides a valuable platform for the surveillance and monitoring of public health topics; however, manually categorizing large quantities of Twitter data is labor intensive and presents barriers to identify major trends and sentiments. Additionally, while machine and deep learning approaches have been proposed with high accuracy, they require large, annotated data sets. Public pretrained deep learning classification models, such as BERTweet, produce higher-quality models while using smaller annotated training sets.

OBJECTIVE

This study aims to derive and evaluate a pretrained deep learning model based on BERTweet that can identify tweets relevant to vaping, tweets (related to vaping) of commercial nature, and tweets with provape sentiment. Additionally, the performance of the BERTweet classifier will be compared against a long short-term memory (LSTM) model to show the improvements a pretrained model has over traditional deep learning approaches.

METHODS

Twitter data were collected from August to October 2019 using vaping-related search terms. From this set, a random subsample of 2401 English tweets was manually annotated for relevance (vaping related or not), commercial nature (commercial or not), and sentiment (positive, negative, or neutral). Using the annotated data, 3 separate classifiers were built using BERTweet with the default parameters defined by the Simple Transformer application programming interface (API). Each model was trained for 20 iterations and evaluated with a random split of the annotated tweets, reserving 10% (n=165) of tweets for evaluations.

RESULTS

The relevance, commercial, and sentiment classifiers achieved an area under the receiver operating characteristic curve (AUROC) of 94.5%, 99.3%, and 81.7%, respectively. Additionally, the weighted F1 scores of each were 97.6%, 99.0%, and 86.1%, respectively. We found that BERTweet outperformed the LSTM model in the classification of all categories.

CONCLUSIONS

Large, open-source deep learning classifiers, such as BERTweet, can provide researchers the ability to reliably determine if tweets are relevant to vaping; include commercial content; and include positive, negative, or neutral content about vaping with a higher accuracy than traditional natural language processing deep learning models. Such enhancement to the utilization of Twitter data can allow for faster exploration and dissemination of time-sensitive data than traditional methodologies (eg, surveys, polling research).

摘要

背景

推特为公共卫生话题的监测和监督提供了一个有价值的平台；然而，手动对大量推特数据进行分类需要耗费大量人力，并且在识别主要趋势和情绪方面存在障碍。此外，虽然已经提出了具有高精度的机器学习和深度学习方法，但它们需要大量的带注释数据集。公共预训练深度学习分类模型，如BERTweet，在使用较小的带注释训练集时能产生更高质量的模型。

目的

本研究旨在推导和评估基于BERTweet的预训练深度学习模型，该模型能够识别与电子烟相关的推文、具有商业性质的（与电子烟相关的）推文以及带有支持电子烟情绪的推文。此外，将把BERTweet分类器的性能与长短期记忆（LSTM）模型进行比较，以展示预训练模型相对于传统深度学习方法的改进。

方法

使用与电子烟相关的搜索词，于2019年8月至10月收集推特数据。从该数据集中，随机抽取2401条英文推文的子样本，人工标注其相关性（是否与电子烟相关）、商业性质（是否为商业性质）和情绪（积极、消极或中性）。使用带注释的数据，使用BERTweet并采用由Simple Transformer应用程序编程接口（API）定义的默认参数构建3个单独的分类器。每个模型训练20次迭代，并使用带注释推文的随机划分进行评估，保留10%（n = 165）的推文用于评估。

结果

相关性、商业性质和情绪分类器的受试者工作特征曲线下面积（AUROC）分别达到94.5%、99.3%和81.7%。此外，每个分类器的加权F1分数分别为97.6%、99.0%和86.1%。我们发现BERTweet在所有类别的分类中均优于LSTM模型。

结论

大型开源深度学习分类器，如BERTweet，能够使研究人员可靠地确定推文是否与电子烟相关；是否包含商业内容；以及是否包含关于电子烟的积极、消极或中性内容，其准确性高于传统自然语言处理深度学习模型。与传统方法（如调查、民意调查研究）相比，对推特数据利用的这种增强能够更快地探索和传播对时间敏感的数据。

相似文献

Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study.

JMIR Med Inform. 2022 Jul 21;10(7):e33678. doi: 10.2196/33678.

Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study.

J Med Internet Res. 2020 Aug 12;22(8):e17478. doi: 10.2196/17478.

Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis.

J Med Internet Res. 2023 Oct 16;25:e47014. doi: 10.2196/47014.

Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada.

Front Digit Health. 2023 Jun 28;5:1203874. doi: 10.3389/fdgth.2023.1203874. eCollection 2023.

Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning.

J Med Internet Res. 2015 Aug 25;17(8):e208. doi: 10.2196/jmir.4392.

"When 'Bad' is 'Good'": Identifying Personal Communication and Sentiment in Drug-Related Tweets.

JMIR Public Health Surveill. 2016 Oct 24;2(2):e162. doi: 10.2196/publichealth.6327.

Exploring Eating Disorder Topics on Twitter: Machine Learning Approach.

JMIR Med Inform. 2020 Oct 30;8(10):e18273. doi: 10.2196/18273.

Automated Detection of Vaping-Related Tweets on Twitter During the 2019 EVALI Outbreak Using Machine Learning Classification.

Front Big Data. 2022 Feb 10;5:770585. doi: 10.3389/fdata.2022.770585. eCollection 2022.

Identifying Key Topics Bearing Negative Sentiment on Twitter: Insights Concerning the 2015-2016 Zika Epidemic.

JMIR Public Health Surveill. 2019 Jun 4;5(2):e11036. doi: 10.2196/11036.

Using #ActuallyAutistic on Twitter for Precision Diagnosis of Autism Spectrum Disorder: Machine Learning Study.

JMIR Form Res. 2024 Feb 14;8:e52660. doi: 10.2196/52660.

引用本文的文献

Public perception and changing attitudes toward antidepressants over a decade in social media: Lessons learned from online discussion using artificial intelligence.

PLoS One. 2025 Sep 4;20(9):e0318464. doi: 10.1371/journal.pone.0318464. eCollection 2025.

Assessment of beliefs and attitudes towards benzodiazepines using machine learning based on social media posts: an observational study.

BMC Psychiatry. 2024 Oct 8;24(1):659. doi: 10.1186/s12888-024-06111-5.

Twitter Sentiment About the US Federal Tobacco 21 Law: Mixed Methods Analysis.

JMIR Form Res. 2023 Aug 31;7:e50346. doi: 10.2196/50346.

Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada.

Front Digit Health. 2023 Jun 28;5:1203874. doi: 10.3389/fdgth.2023.1203874. eCollection 2023.

Artificial Intelligence-Enabled Analysis of Statin-Related Topics and Sentiments on Social Media.

JAMA Netw Open. 2023 Apr 3;6(4):e239747. doi: 10.1001/jamanetworkopen.2023.9747.

本文引用的文献

Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study.

J Med Internet Res. 2020 Aug 12;22(8):e17478. doi: 10.2196/17478.

Twitter sentiment classification for measuring public health concerns.

Soc Netw Anal Min. 2015;5(1):13. doi: 10.1007/s13278-015-0253-5. Epub 2015 May 12.

I wake up and hit the JUUL: Analyzing Twitter for JUUL nicotine effects and dependence.

Drug Alcohol Depend. 2019 Nov 1;204:107500. doi: 10.1016/j.drugalcdep.2019.06.005. Epub 2019 Aug 30.

Toward Real-Time Infoveillance of Twitter Health Messages.

Am J Public Health. 2018 Aug;108(8):1009-1014. doi: 10.2105/AJPH.2018.304497. Epub 2018 Jun 21.

Vaping versus JUULing: how the extraordinary growth and marketing of JUUL transformed the US retail e-cigarette market.

Tob Control. 2019 Mar;28(2):146-151. doi: 10.1136/tobaccocontrol-2018-054382. Epub 2018 May 31.

Sentiment Analysis of Health Care Tweets: Review of the Methods Used.

JMIR Public Health Surveill. 2018 Apr 23;4(2):e43. doi: 10.2196/publichealth.5789.

Exploratory Analysis of Marketing and Non-marketing E-cigarette Themes on Twitter.

Soc Inform (2016). 2016 Nov;10047:307-322. doi: 10.1007/978-3-319-47874-6_22. Epub 2016 Oct 19.

Learning to Monitor Machine Health with Convolutional Bi-Directional LSTM Networks.

Sensors (Basel). 2017 Jan 30;17(2):273. doi: 10.3390/s17020273.

Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning.

J Med Internet Res. 2015 Aug 25;17(8):e208. doi: 10.2196/jmir.4392.

Deep learning.

Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用BERTweet对推特上的电子烟话语进行分类：比较深度学习研究。

Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献