• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用ADHAR进行仇恨言论检测:一个阿拉伯语多方言仇恨言论语料库。

Hate speech detection with ADHAR: a multi-dialectal hate speech corpus in Arabic.

作者信息

Charfi Anis, Besghaier Mabrouka, Akasheh Raghda, Atalla Andria, Zaghouani Wajdi

机构信息

Information Systems Department, Carnegie Mellon University, Doha, Qatar.

College of Humanities and Social Sciences, Hamad Bin Khalifa University, Doha, Qatar.

出版信息

Front Artif Intell. 2024 May 30;7:1391472. doi: 10.3389/frai.2024.1391472. eCollection 2024.

DOI:10.3389/frai.2024.1391472
PMID:38873176
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11170444/
Abstract

Hate speech detection in Arabic poses a complex challenge due to the dialectal diversity across the Arab world. Most existing hate speech datasets for Arabic cover only one dialect or one hate speech category. They also lack balance across dialects, topics, and hate/non-hate classes. In this paper, we address this gap by presenting ADHAR-a comprehensive multi-dialect, multi-category hate speech corpus for Arabic. ADHAR contains 70,369 words and spans four language variants: Modern Standard Arabic (MSA), Egyptian, Levantine, Gulf and Maghrebi. It covers four key hate speech categories: nationality, religion, ethnicity, and race. A major contribution is that ADHAR is carefully curated to maintain balance across dialects, categories, and hate/non-hate classes to enable unbiased dataset evaluation. We describe the systematic data collection methodology, followed by a rigorous annotation process involving multiple annotators per dialect. Extensive qualitative and quantitative analyses demonstrate the quality and usefulness of ADHAR. Our experiments with various classical and deep learning models demonstrate that our dataset enables the development of robust hate speech classifiers for Arabic, achieving accuracy and F1-scores of up to 90% for hate speech detection and up to 92% for category detection. When trained with Arabert, we achieved an accuracy and F1-score of 94% for hate speech detection, as well as 95% for the category detection.

摘要

由于阿拉伯世界方言的多样性,阿拉伯语中的仇恨言论检测面临着复杂的挑战。大多数现有的阿拉伯语仇恨言论数据集只涵盖一种方言或一个仇恨言论类别。它们在方言、主题以及仇恨/非仇恨类别之间也缺乏平衡性。在本文中,我们通过呈现ADHAR来填补这一空白——ADHAR是一个全面的多方言、多类别的阿拉伯语仇恨言论语料库。ADHAR包含70369个单词,涵盖四种语言变体:现代标准阿拉伯语(MSA)、埃及语、黎凡特语、海湾语和马格里布语。它涵盖四个关键的仇恨言论类别:国籍、宗教、种族和民族。一个主要贡献是,ADHAR经过精心策划,以保持方言、类别以及仇恨/非仇恨类别之间的平衡,从而实现无偏差的数据集评估。我们描述了系统的数据收集方法,随后是一个严格的注释过程,每个方言涉及多个注释者。广泛的定性和定量分析证明了ADHAR的质量和实用性。我们使用各种经典和深度学习模型进行的实验表明,我们的数据集能够开发出强大的阿拉伯语仇恨言论分类器,仇恨言论检测的准确率和F1分数高达90%,类别检测的准确率和F1分数高达92%。当使用Arabert进行训练时,我们在仇恨言论检测方面的准确率和F1分数达到了94%,在类别检测方面达到了95%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60cc/11170444/0c3153bfbbec/frai-07-1391472-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60cc/11170444/9c0f4e5d8622/frai-07-1391472-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60cc/11170444/c217c44a3c72/frai-07-1391472-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60cc/11170444/0c3153bfbbec/frai-07-1391472-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60cc/11170444/9c0f4e5d8622/frai-07-1391472-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60cc/11170444/c217c44a3c72/frai-07-1391472-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60cc/11170444/0c3153bfbbec/frai-07-1391472-g0003.jpg

相似文献

1
Hate speech detection with ADHAR: a multi-dialectal hate speech corpus in Arabic.使用ADHAR进行仇恨言论检测:一个阿拉伯语多方言仇恨言论语料库。
Front Artif Intell. 2024 May 30;7:1391472. doi: 10.3389/frai.2024.1391472. eCollection 2024.
2
Hate speech detection in the Arabic language: corpus design, construction, and evaluation.阿拉伯语中的仇恨言论检测:语料库设计、构建与评估。
Front Artif Intell. 2024 Feb 20;7:1345445. doi: 10.3389/frai.2024.1345445. eCollection 2024.
3
Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models.代码混合揭秘:使用机器学习模型增强阿拉伯方言推文中的仇恨言论检测
PLoS One. 2024 Jul 17;19(7):e0305657. doi: 10.1371/journal.pone.0305657. eCollection 2024.
4
A systematic literature review of hate speech identification on Arabic Twitter data: research challenges and future directions.关于阿拉伯语推特数据中仇恨言论识别的系统文献综述:研究挑战与未来方向。
PeerJ Comput Sci. 2024 Apr 2;10:e1966. doi: 10.7717/peerj-cs.1966. eCollection 2024.
5
IADD: An integrated Arabic dialect identification dataset.IADD:一个综合的阿拉伯方言识别数据集。
Data Brief. 2021 Dec 30;40:107777. doi: 10.1016/j.dib.2021.107777. eCollection 2022 Feb.
6
The design, construction and evaluation of annotated Arabic cyberbullying corpus.带注释的阿拉伯语网络欺凌语料库的设计、构建与评估。
Educ Inf Technol (Dordr). 2022;27(8):10977-11023. doi: 10.1007/s10639-022-11056-x. Epub 2022 Apr 28.
7
Detection of Hate Speech in COVID-19-Related Tweets in the Arab Region: Deep Learning and Topic Modeling Approach.检测阿拉伯地区与 COVID-19 相关推文的仇恨言论:深度学习和主题建模方法。
J Med Internet Res. 2020 Dec 8;22(12):e22609. doi: 10.2196/22609.
8
Detection of cyberhate speech towards female sport in the Arabic Xsphere.在阿拉伯语网络空间中对女子体育的网络仇恨言论检测。
PeerJ Comput Sci. 2024 Jun 27;10:e2138. doi: 10.7717/peerj-cs.2138. eCollection 2024.
9
Emotionally Informed Hate Speech Detection: A Multi-target Perspective.基于情感信息的仇恨言论检测:多目标视角
Cognit Comput. 2022;14(1):322-352. doi: 10.1007/s12559-021-09862-5. Epub 2021 Jun 28.
10
A Transformer-Based Neural Machine Translation Model for Arabic Dialects That Utilizes Subword Units.基于利用子词单元的阿拉伯方言的基于转换器的神经机器翻译模型。
Sensors (Basel). 2021 Sep 29;21(19):6509. doi: 10.3390/s21196509.