评估基于词典的模型对2022年5月1日至9月5日期间猴痘推文立场检测的自动标注。

Evaluating automatic annotation of lexicon-based models for stance detection of M-pox tweets from May 1st to Sep 5th, 2022.

作者信息

Perikli Nicholas, Bhattacharya Srimoy, Ogbuokiri Blessing, Movahedi Nia Zahra, Lieberman Benjamin, Tripathi Nidhi, Dahbi Salah-Eddine, Stevenson Finn, Bragazzi Nicola, Kong Jude, Mellado Bruce

机构信息

School of Physics and Institute for Collider Particle Physics, University of the Witwatersrand, Johannesburg, South Africa.

iThemba LABS, National Research Foundation, Cape Town, South Africa.

出版信息

PLOS Digit Health. 2024 Jul 30;3(7):e0000545. doi: 10.1371/journal.pdig.0000545. eCollection 2024 Jul.

DOI:10.1371/journal.pdig.0000545

PMID:39078813

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11288444/

Abstract

Manually labeling data for supervised learning is time and energy consuming; therefore, lexicon-based models such as VADER and TextBlob are used to automatically label data. However, it is argued that automated labels do not have the accuracy required for training an efficient model. Although automated labeling is frequently used for stance detection, automated stance labels have not been properly evaluated, in the previous works. In this work, to assess the accuracy of VADER and TextBlob automated labels for stance analysis, we first manually label a Twitter, now X, dataset related to M-pox stance detection. We then fine-tune different transformer-based models on the hand-labeled M-pox dataset, and compare their accuracy before and after fine-tuning, with the accuracy of automated labeled data. Our results indicated that the fine-tuned models surpassed the accuracy of VADER and TextBlob automated labels by up to 38% and 72.5%, respectively. Topic modeling further shows that fine-tuning diminished the scope of misclassified tweets to specific sub-topics. We conclude that fine-tuning transformer models on hand-labeled data for stance detection, elevates the accuracy to a superior level that is significantly higher than automated stance detection labels. This study verifies that automated stance detection labels are not reliable for sensitive use-cases such as health-related purposes. Manually labeled data is more convenient for developing Natural Language Processing (NLP) models that study and analyze mass opinions and conversations on social media platforms, during crises such as pandemics and epidemics.

摘要

为监督学习手动标注数据既耗时又耗力；因此，诸如VADER和TextBlob等基于词汇的模型被用于自动标注数据。然而，有人认为自动标注的标签不具备训练高效模型所需的准确性。尽管自动标注经常用于立场检测，但在以往的研究中，自动立场标签尚未得到恰当评估。在本研究中，为评估VADER和TextBlob自动标签用于立场分析的准确性，我们首先手动标注了一个与猴痘立场检测相关的推特（现称X）数据集。然后，我们在人工标注的猴痘数据集上对不同的基于Transformer的模型进行微调，并将微调前后它们的准确性与自动标注数据的准确性进行比较。我们的结果表明，微调后的模型分别比VADER和TextBlob自动标签的准确性高出38%和72.5%。主题建模进一步表明，微调将误分类推文的范围缩小到特定子主题。我们得出结论，在人工标注的数据上对Transformer模型进行微调以进行立场检测，可将准确性提升到一个高于自动立场检测标签的卓越水平。本研究证实，对于诸如与健康相关目的等敏感用例，自动立场检测标签并不可靠。在大流行和疫情等危机期间，人工标注的数据对于开发研究和分析社交媒体平台上大量观点和对话的自然语言处理（NLP）模型更为便利。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d661/11288444/b0f7263bf131/pdig.0000545.g001.jpg

相似文献

Evaluating automatic annotation of lexicon-based models for stance detection of M-pox tweets from May 1st to Sep 5th, 2022.评估基于词典的模型对2022年5月1日至9月5日期间猴痘推文立场检测的自动标注。

PLOS Digit Health. 2024 Jul 30;3(7):e0000545. doi: 10.1371/journal.pdig.0000545. eCollection 2024 Jul.

Understanding the vaccine stance of Italian tweets and addressing language changes through the COVID-19 pandemic: Development and validation of a machine learning model.理解意大利语推文中的疫苗立场，并通过 COVID-19 大流行解决语言变化问题：机器学习模型的开发和验证。

Front Public Health. 2022 Jul 29;10:948880. doi: 10.3389/fpubh.2022.948880. eCollection 2022.

GPT-4 as an X data annotator: Unraveling its performance on a stance classification task.GPT-4 作为 X 数据标注员：在立场分类任务中表现如何。

PLoS One. 2024 Aug 15;19(8):e0307741. doi: 10.1371/journal.pone.0307741. eCollection 2024.

Dynamics of the Negative Discourse Toward COVID-19 Vaccines: Topic Modeling Study and an Annotated Data Set of Twitter Posts.针对 COVID-19 疫苗的负面话语动态：主题建模研究与 Twitter 帖子的标注数据集。

J Med Internet Res. 2023 Apr 12;25:e41319. doi: 10.2196/41319.

A stance dataset with aspect-based sentiment information from Indonesian COVID-19 vaccination-related tweets.一个来自印尼与新冠疫苗接种相关推文的、带有基于方面的情感信息的立场数据集。

Data Brief. 2023 Apr;47:108951. doi: 10.1016/j.dib.2023.108951. Epub 2023 Feb 4.

Data and models for stance and premise detection in COVID-19 tweets: Insights from the Social Media Mining for Health (SMM4H) 2022 shared task.新冠疫情推文立场与前提检测的数据和模型：来自2022年社交媒体健康挖掘（SMM4H）共享任务的见解

J Biomed Inform. 2024 Jan;149:104555. doi: 10.1016/j.jbi.2023.104555. Epub 2023 Nov 24.

Categorizing Vaccine Confidence With a Transformer-Based Machine Learning Model: Analysis of Nuances of Vaccine Sentiment in Twitter Discourse.使用基于Transformer的机器学习模型对疫苗信心进行分类：推特话语中疫苗情绪细微差别分析

JMIR Med Inform. 2021 Oct 8;9(10):e29584. doi: 10.2196/29584.

SentiUrdu-1M: A large-scale tweet dataset for Urdu text sentiment analysis using weakly supervised learning.SentiUrdu-1M：一种使用弱监督学习的大规模推文数据集，用于乌尔都语文本情感分析。

PLoS One. 2023 Aug 30;18(8):e0290779. doi: 10.1371/journal.pone.0290779. eCollection 2023.

A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis.监督机器学习模型在新冠病毒推文情感分析中的性能比较。

PLoS One. 2021 Feb 25;16(2):e0245909. doi: 10.1371/journal.pone.0245909. eCollection 2021.

Development of a COVID-19-Related Anti-Asian Tweet Data Set: Quantitative Study.与新冠疫情相关的反亚裔推文数据集的开发：定量研究。

JMIR Form Res. 2023 Feb 28;7:e40403. doi: 10.2196/40403.

本文引用的文献

Longitudinal analysis of behavioral factors and techniques used to identify vaccine hesitancy among Twitter users: Scoping review.纵向分析行为因素和技术，以识别 Twitter 用户中的疫苗犹豫：范围综述。

Hum Vaccin Immunother. 2023 Dec 15;19(3):2278377. doi: 10.1080/21645515.2023.2278377. Epub 2023 Nov 20.

Twitter-based gender recognition using transformers.基于转换器的 Twitter 性别识别。

Math Biosci Eng. 2023 Aug 3;20(9):15962-15981. doi: 10.3934/mbe.2023711.

Mpox Vaccination Hesitancy and Its Associated Factors among Men Who Have Sex with Men in China: A National Observational Study.中国男男性行为者中猴痘疫苗接种犹豫及其相关因素：一项全国性观察性研究

Vaccines (Basel). 2023 Aug 30;11(9):1432. doi: 10.3390/vaccines11091432.

Off-label drug use during the COVID-19 pandemic in Africa: topic modelling and sentiment analysis of ivermectin in South Africa and Nigeria as a case study.非洲 COVID-19 大流行期间的标签外用药：以南非和尼日利亚的伊维菌素为例的主题建模和情绪分析。

J R Soc Interface. 2023 Sep;20(206):20230200. doi: 10.1098/rsif.2023.0200. Epub 2023 Sep 13.

Global Misinformation Spillovers in the Vaccination Debate Before and During the COVID-19 Pandemic: Multilingual Twitter Study.新冠疫情之前及期间疫苗接种辩论中的全球错误信息传播：多语言推特研究

JMIR Infodemiology. 2023 May 24;3:e44714. doi: 10.2196/44714.

Mpox Panic, Infodemic, and Stigmatization of the Two-Spirit, Lesbian, Gay, Bisexual, Transgender, Queer or Questioning, Intersex, Asexual Community: Geospatial Analysis, Topic Modeling, and Sentiment Analysis of a Large, Multilingual Social Media Database.猴痘恐慌、信息疫情以及对双灵、女同性恋、男同性恋、双性恋、跨性别、酷儿或疑问、间性、无性社群的污名化：大规模多语言社交媒体数据库的地理空间分析、主题建模和情感分析。

J Med Internet Res. 2023 May 1;25:e45108. doi: 10.2196/45108.

Sources, diffusion and prediction in COVID-19 pandemic: lessons learned to face next health emergency.新冠疫情中的信息源、传播与预测：应对下一次卫生紧急事件的经验教训

AIMS Public Health. 2023 Mar 2;10(1):145-168. doi: 10.3934/publichealth.2023012. eCollection 2023.

COVID-19 vaccine rejection causes based on Twitter people's opinions analysis using deep learning.基于深度学习的推特用户观点分析得出的新冠疫苗接种拒绝原因

Soc Netw Anal Min. 2023;13(1):62. doi: 10.1007/s13278-023-01059-y. Epub 2023 Apr 3.

Nowcasting unemployment rate during the COVID-19 pandemic using Twitter data: The case of South Africa.利用 Twitter 数据实时预测 COVID-19 大流行期间的失业率：以南非为例。

Front Public Health. 2022 Dec 2;10:952363. doi: 10.3389/fpubh.2022.952363. eCollection 2022.

Infodemics and health misinformation: a systematic review of reviews.信息疫情与健康错误信息：系统综述。

Bull World Health Organ. 2022 Sep 1;100(9):544-561. doi: 10.2471/BLT.21.287654. Epub 2022 Jun 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估基于词典的模型对2022年5月1日至9月5日期间猴痘推文立场检测的自动标注。

Evaluating automatic annotation of lexicon-based models for stance detection of M-pox tweets from May 1st to Sep 5th, 2022.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献