Suppr超能文献

与新冠疫情相关的反亚裔推文数据集的开发:定量研究。

Development of a COVID-19-Related Anti-Asian Tweet Data Set: Quantitative Study.

作者信息

Mokhberi Maryam, Biswas Ahana, Masud Zarif, Kteily-Hawa Roula, Goldstein Abby, Gillis Joseph Roy, Rayana Shebuti, Ahmed Syed Ishtiaque

机构信息

Department of Computer Science, University of Toronto, Toronto, ON, Canada.

Indian Institute of Technology Kanpur, Kanpur, India.

出版信息

JMIR Form Res. 2023 Feb 28;7:e40403. doi: 10.2196/40403.

Abstract

BACKGROUND

Since the advent of the COVID-19 pandemic, individuals of Asian descent (colloquial usage prevalent in North America, where "Asian" is used to refer to people from East Asia, particularly China) have been the subject of stigma and hate speech in both offline and online communities. One of the major venues for encountering such unfair attacks is social networks, such as Twitter. As the research community seeks to understand, analyze, and implement detection techniques, high-quality data sets are becoming immensely important.

OBJECTIVE

In this study, we introduce a manually labeled data set of tweets containing anti-Asian stigmatizing content.

METHODS

We sampled over 668 million tweets posted on Twitter from January to July 2020 and used an iterative data construction approach that included 3 different stages of algorithm-driven data selection. Finally, we found volunteers who manually annotated the tweets by hand to arrive at a high-quality data set of tweets and a second, more sampled data set with higher-quality labels from multiple annotators. We presented this final high-quality Twitter data set on stigma toward Chinese people during the COVID-19 pandemic. The data set and instructions for labeling can be viewed in the Github repository. Furthermore, we implemented some state-of-the-art models to detect stigmatizing tweets to set initial benchmarks for our data set.

RESULTS

Our primary contributions are labeled data sets. Data Set v3.0 contained 11,263 tweets with primary labels (unknown/irrelevant, not-stigmatizing, stigmatizing-low, stigmatizing-medium, stigmatizing-high) and tweet subtopics (eg, wet market and eating habits, COVID-19 cases, bioweapon). Data Set v3.1 contained 4998 (44.4%) tweets randomly sampled from Data Set v3.0, where a second annotator labeled them only on the primary labels and then a third annotator resolved conflicts between the first and second annotators. To demonstrate the usefulness of our data set, preliminary experiments on the data set showed that the Bidirectional Encoder Representations from Transformers (BERT) model achieved the highest accuracy of 79% when detecting stigma on unseen data with traditional models, such as a support vector machine (SVM) performing at 73% accuracy.

CONCLUSIONS

Our data set can be used as a benchmark for further qualitative and quantitative research and analysis around the issue. It first reaffirms the existence and significance of widespread discrimination and stigma toward the Asian population worldwide. Moreover, our data set and subsequent arguments should assist other researchers from various domains, including psychologists, public policy authorities, and sociologists, to analyze the complex economic, political, historical, and cultural underlying roots of anti-Asian stigmatization and hateful behaviors. A manually annotated data set is of paramount importance for developing algorithms that can be used to detect stigma or problematic text, particularly on social media. We believe this contribution will help predict and subsequently design interventions that will significantly help reduce stigma, hate, and discrimination against marginalized populations during future crises like COVID-19.

摘要

背景

自新冠疫情爆发以来,亚裔(在北美普遍使用的通俗说法,“亚裔”指来自东亚,尤其是中国的人)在线下和线上社区都成为了污名化和仇恨言论的对象。遭遇此类不公平攻击的主要场所之一是社交网络,如推特。随着研究界试图理解、分析和实施检测技术,高质量的数据集变得极其重要。

目的

在本研究中,我们引入了一个包含反亚裔污名化内容的推文手动标注数据集。

方法

我们对2020年1月至7月在推特上发布的超过6.68亿条推文进行了采样,并使用了一种迭代数据构建方法,该方法包括算法驱动的数据选择的3个不同阶段。最后,我们找到了志愿者对手动标注推文,从而获得一个高质量的推文数据集以及另一个经过更多采样、由多个标注者提供更高质量标签的数据集。我们展示了这个关于新冠疫情期间对中国人污名化的最终高质量推特数据集。该数据集和标注说明可在Github仓库中查看。此外,我们实施了一些先进模型来检测污名化推文,为我们的数据集设定初始基准。

结果

我们的主要贡献是标注数据集。数据集v3.0包含11263条带有主要标签(未知/不相关、无污名化、低污名化、中污名化、高污名化)和推文子主题(如湿货市场与饮食习惯、新冠病例、生物武器)的推文。数据集v3.1包含从数据集v3.0中随机抽取的4998条(44.4%)推文,其中第二个标注者仅对主要标签进行标注,然后第三个标注者解决第一个和第二个标注者之间的冲突。为了证明我们数据集的有用性,对该数据集的初步实验表明,当使用传统模型(如支持向量机,准确率为73%)在未见数据上检测污名时,来自变换器的双向编码器表征(BERT)模型达到了最高准确率79%。

结论

我们的数据集可作为围绕该问题进行进一步定性和定量研究与分析的基准。它首先重申了全球范围内对亚洲人群广泛存在的歧视和污名的存在及重要性。此外,我们的数据集及后续论证应有助于包括心理学家、公共政策当局和社会学家在内的各领域其他研究人员分析反亚裔污名化和仇恨行为背后复杂的经济、政治、历史和文化根源。一个手动标注的数据集对于开发可用于检测污名或问题文本的算法至关重要,尤其是在社交媒体上。我们相信这一贡献将有助于预测并随后设计干预措施,这将在未来类似新冠疫情这样的危机中显著帮助减少对边缘化人群的污名、仇恨和歧视。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/749e/9976773/174e1ab6c470/formative_v7i1e40403_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验