Toliyat Amir, Levitan Sarah Ita, Peng Zheng, Etemadpour Ronak
Computer Science Program, Graduate Center, City University of New York, New York, NY, United States.
Computer Science Program, Hunter College, City University of New York, New York, NY, United States.
Front Artif Intell. 2022 Aug 15;5:932381. doi: 10.3389/frai.2022.932381. eCollection 2022.
Coronavirus disease 2019 (COVID-19) started in Wuhan, China, in late 2019, and after being utterly contagious in Asian countries, it rapidly spread to other countries. This disease caused governments worldwide to declare a public health crisis with severe measures taken to reduce the speed of the spread of the disease. This pandemic affected the lives of millions of people. Many citizens that lost their loved ones and jobs experienced a wide range of emotions, such as disbelief, shock, concerns about health, fear about food supplies, anxiety, and panic. All of the aforementioned phenomena led to the spread of racism and hate against Asians in western countries, especially in the United States. An analysis of official preliminary police data by the Center for the Study of Hate & Extremism at California State University shows that Anti-Asian hate crime in 16 of America's largest cities increased by 149% in 2020. In this study, we first chose a baseline of Americans' hate crimes against Asians on Twitter. Then we present an approach to balance the biased dataset and consequently improve the performance of tweet classification. We also have downloaded 10 million tweets through the Twitter API V-2. In this study, we have used a small portion of that, and we will use the entire dataset in the future study. In this article, three thousand tweets from our collected corpus are annotated by four annotators, including three Asian and one Asian-American. Using this data, we built predictive models of hate speech using various machine learning and deep learning methods. Our machine learning methods include Random Forest, K-nearest neighbors (KNN), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Logistic Regression, Decision Tree, and Naive Bayes. Our Deep Learning models include Basic Long-Term Short-Term Memory (LSTM), Bidirectional LSTM, Bidirectional LSTM with Drop out, Convolution, and Bidirectional Encoder Representations from Transformers (BERT). We also adjusted our dataset by filtering tweets that were ambiguous to the annotators based on low Fleiss Kappa agreement between annotators. Our final result showed that Logistic Regression achieved the best statistical machine learning performance with an F1 score of 0.72, while BERT achieved the best performance of the deep learning models, with an F1-Score of 0.85.
2019年冠状病毒病(COVID-19)于2019年末在中国武汉爆发,在亚洲国家迅速传播后,又迅速蔓延至其他国家。这场疾病促使世界各国政府宣布进入公共卫生危机状态,并采取严厉措施以减缓疾病传播速度。这场大流行影响了数百万人的生活。许多失去亲人和工作的公民经历了各种各样的情绪,如怀疑、震惊、对健康的担忧、对食品供应的恐惧、焦虑和恐慌。所有上述现象导致了西方国家,尤其是美国针对亚洲人的种族主义和仇恨情绪的蔓延。加利福尼亚州立大学仇恨与极端主义研究中心对官方初步警方数据的分析显示,2020年美国16个最大城市的反亚裔仇恨犯罪增加了149%。在本研究中,我们首先选取了美国人在推特上针对亚洲人的仇恨犯罪基线。然后我们提出一种方法来平衡有偏差的数据集,从而提高推文分类的性能。我们还通过推特API V-2下载了1000万条推文。在本研究中,我们使用了其中的一小部分,未来的研究中将使用整个数据集。在本文中,我们收集的语料库中的三千条推文由四名注释者进行注释,其中包括三名亚洲人和一名亚裔美国人。利用这些数据,我们使用各种机器学习和深度学习方法构建了仇恨言论预测模型。我们的机器学习方法包括随机森林、K近邻(KNN)、支持向量机(SVM)、极端梯度提升(XGBoost)、逻辑回归、决策树和朴素贝叶斯。我们的深度学习模型包括基本长短期记忆(LSTM)、双向LSTM、带随机失活的双向LSTM、卷积以及基于变换器的双向编码器表征(BERT)。我们还根据注释者之间较低的弗赖斯kappa一致性,过滤掉注释者认为模糊的推文,对数据集进行了调整。我们的最终结果表明,逻辑回归的统计机器学习性能最佳,F1分数为0.72,而BERT在深度学习模型中性能最佳,F1分数为0.85。