Baker William, Colditz Jason B, Dobbs Page D, Mai Huy, Visweswaran Shyam, Zhan Justin, Primack Brian A
Department of Computer Science and Computer Engineering, University of Arkansas, Fayetteville, AR, United States.
Division of General Internal Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States.
JMIR Med Inform. 2022 Jul 21;10(7):e33678. doi: 10.2196/33678.
Twitter provides a valuable platform for the surveillance and monitoring of public health topics; however, manually categorizing large quantities of Twitter data is labor intensive and presents barriers to identify major trends and sentiments. Additionally, while machine and deep learning approaches have been proposed with high accuracy, they require large, annotated data sets. Public pretrained deep learning classification models, such as BERTweet, produce higher-quality models while using smaller annotated training sets.
This study aims to derive and evaluate a pretrained deep learning model based on BERTweet that can identify tweets relevant to vaping, tweets (related to vaping) of commercial nature, and tweets with provape sentiment. Additionally, the performance of the BERTweet classifier will be compared against a long short-term memory (LSTM) model to show the improvements a pretrained model has over traditional deep learning approaches.
Twitter data were collected from August to October 2019 using vaping-related search terms. From this set, a random subsample of 2401 English tweets was manually annotated for relevance (vaping related or not), commercial nature (commercial or not), and sentiment (positive, negative, or neutral). Using the annotated data, 3 separate classifiers were built using BERTweet with the default parameters defined by the Simple Transformer application programming interface (API). Each model was trained for 20 iterations and evaluated with a random split of the annotated tweets, reserving 10% (n=165) of tweets for evaluations.
The relevance, commercial, and sentiment classifiers achieved an area under the receiver operating characteristic curve (AUROC) of 94.5%, 99.3%, and 81.7%, respectively. Additionally, the weighted F1 scores of each were 97.6%, 99.0%, and 86.1%, respectively. We found that BERTweet outperformed the LSTM model in the classification of all categories.
Large, open-source deep learning classifiers, such as BERTweet, can provide researchers the ability to reliably determine if tweets are relevant to vaping; include commercial content; and include positive, negative, or neutral content about vaping with a higher accuracy than traditional natural language processing deep learning models. Such enhancement to the utilization of Twitter data can allow for faster exploration and dissemination of time-sensitive data than traditional methodologies (eg, surveys, polling research).
推特为公共卫生话题的监测和监督提供了一个有价值的平台;然而,手动对大量推特数据进行分类需要耗费大量人力,并且在识别主要趋势和情绪方面存在障碍。此外,虽然已经提出了具有高精度的机器学习和深度学习方法,但它们需要大量的带注释数据集。公共预训练深度学习分类模型,如BERTweet,在使用较小的带注释训练集时能产生更高质量的模型。
本研究旨在推导和评估基于BERTweet的预训练深度学习模型,该模型能够识别与电子烟相关的推文、具有商业性质的(与电子烟相关的)推文以及带有支持电子烟情绪的推文。此外,将把BERTweet分类器的性能与长短期记忆(LSTM)模型进行比较,以展示预训练模型相对于传统深度学习方法的改进。
使用与电子烟相关的搜索词,于2019年8月至10月收集推特数据。从该数据集中,随机抽取2401条英文推文的子样本,人工标注其相关性(是否与电子烟相关)、商业性质(是否为商业性质)和情绪(积极、消极或中性)。使用带注释的数据,使用BERTweet并采用由Simple Transformer应用程序编程接口(API)定义的默认参数构建3个单独的分类器。每个模型训练20次迭代,并使用带注释推文的随机划分进行评估,保留10%(n = 165)的推文用于评估。
相关性、商业性质和情绪分类器的受试者工作特征曲线下面积(AUROC)分别达到94.5%、99.3%和81.7%。此外,每个分类器的加权F1分数分别为97.6%、99.0%和86.1%。我们发现BERTweet在所有类别的分类中均优于LSTM模型。
大型开源深度学习分类器,如BERTweet,能够使研究人员可靠地确定推文是否与电子烟相关;是否包含商业内容;以及是否包含关于电子烟的积极、消极或中性内容,其准确性高于传统自然语言处理深度学习模型。与传统方法(如调查、民意调查研究)相比,对推特数据利用的这种增强能够更快地探索和传播对时间敏感的数据。