Wich Maximilian, Eder Tobias, Al Kuwatly Hala, Groh Georg
Technical University of Munich, Munich, Germany.
AI Ethics. 2022;2(1):79-101. doi: 10.1007/s43681-021-00081-0. Epub 2021 Jul 19.
Recently, numerous datasets have been produced as research activities in the field of automatic detection of abusive language or hate speech have increased. A problem with this diversity is that they often differ, among other things, in context, platform, sampling process, collection strategy, and labeling schema. There have been surveys on these datasets, but they compare the datasets only superficially. Therefore, we developed a bias and comparison framework for abusive language datasets for their in-depth analysis and to provide a comparison of five English and six Arabic datasets. We make this framework available to researchers and data scientists who work with such datasets to be aware of the properties of the datasets and consider them in their work.
最近,随着辱骂性语言或仇恨言论自动检测领域的研究活动增加,大量数据集被生成。这种多样性带来的一个问题是,它们往往在上下文、平台、采样过程、收集策略和标注模式等方面存在差异。已经有针对这些数据集的调查,但它们只是对数据集进行了表面的比较。因此,我们开发了一个针对辱骂性语言数据集的偏差与比较框架,用于深入分析这些数据集,并对五个英语数据集和六个阿拉伯语数据集进行比较。我们将这个框架提供给处理此类数据集的研究人员和数据科学家,以使他们了解数据集的属性,并在工作中加以考虑。