• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

代码混合揭秘:使用机器学习模型增强阿拉伯方言推文中的仇恨言论检测

Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models.

作者信息

Alhazmi Ali, Mahmud Rohana, Idris Norisma, Mohamed Abo Mohamed Elhag, Eke Christopher Ifeanyi

机构信息

Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Malaysia.

Department of Computer Science, College of Engineering and Computer Science, Jazan University, Jazan, Saudi Arabia.

出版信息

PLoS One. 2024 Jul 17;19(7):e0305657. doi: 10.1371/journal.pone.0305657. eCollection 2024.

DOI:10.1371/journal.pone.0305657
PMID:39018339
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11253949/
Abstract

Technological developments over the past few decades have changed the way people communicate, with platforms like social media and blogs becoming vital channels for international conversation. Even though hate speech is vigorously suppressed on social media, it is still a concern that needs to be constantly recognized and observed. The Arabic language poses particular difficulties in the detection of hate speech, despite the considerable efforts made in this area for English-language social media content. Arabic calls for particular consideration when it comes to hate speech detection because of its many dialects and linguistic nuances. Another degree of complication is added by the widespread practice of "code-mixing," in which users merge various languages smoothly. Recognizing this research vacuum, the study aims to close it by examining how well machine learning models containing variation features can detect hate speech, especially when it comes to Arabic tweets featuring code-mixing. Therefore, the objective of this study is to assess and compare the effectiveness of different features and machine learning models for hate speech detection on Arabic hate speech and code-mixing hate speech datasets. To achieve the objectives, the methodology used includes data collection, data pre-processing, feature extraction, the construction of classification models, and the evaluation of the constructed classification models. The findings from the analysis revealed that the TF-IDF feature, when employed with the SGD model, attained the highest accuracy, reaching 98.21%. Subsequently, these results were contrasted with outcomes from three existing studies, and the proposed method outperformed them, underscoring the significance of the proposed method. Consequently, our study carries practical implications and serves as a foundational exploration in the realm of automated hate speech detection in text.

摘要

过去几十年的技术发展改变了人们的交流方式,社交媒体和博客等平台已成为国际交流的重要渠道。尽管社交媒体大力压制仇恨言论,但它仍是一个需要持续关注和审视的问题。在检测仇恨言论方面,阿拉伯语存在特殊困难,尽管在检测英语社交媒体内容方面已付出了相当大的努力。由于阿拉伯语有众多方言和语言细微差别,在检测仇恨言论时需要特别考虑。“语码混合”的广泛使用又增加了一层复杂性,即用户能顺畅地融合多种语言。认识到这一研究空白,本研究旨在通过考察包含变异特征的机器学习模型在检测仇恨言论方面的表现来填补这一空白,尤其是检测带有语码混合的阿拉伯语推文时的表现。因此,本研究的目的是评估和比较不同特征及机器学习模型在阿拉伯语仇恨言论和语码混合仇恨言论数据集上检测仇恨言论的有效性。为实现这些目标,所采用的方法包括数据收集、数据预处理、特征提取、分类模型构建以及对构建好的分类模型进行评估。分析结果显示,TF-IDF特征与SGD模型结合使用时,准确率最高,达到了98.21%。随后,将这些结果与三项现有研究的结果进行了对比,结果表明所提出的方法优于它们,凸显了该方法的重要性。因此,我们的研究具有实际意义,是文本中自动仇恨言论检测领域的一项基础探索。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/cd4853113995/pone.0305657.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/f29d859b6d9f/pone.0305657.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/d00d8c488bbe/pone.0305657.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/5080e422c78b/pone.0305657.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/af6312c076df/pone.0305657.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/4c5692c0fbdf/pone.0305657.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/824f1c27bcf8/pone.0305657.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/fec83bbecbfb/pone.0305657.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/cd4853113995/pone.0305657.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/f29d859b6d9f/pone.0305657.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/d00d8c488bbe/pone.0305657.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/5080e422c78b/pone.0305657.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/af6312c076df/pone.0305657.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/4c5692c0fbdf/pone.0305657.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/824f1c27bcf8/pone.0305657.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/fec83bbecbfb/pone.0305657.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a03/11253949/cd4853113995/pone.0305657.g008.jpg

相似文献

1
Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models.代码混合揭秘:使用机器学习模型增强阿拉伯方言推文中的仇恨言论检测
PLoS One. 2024 Jul 17;19(7):e0305657. doi: 10.1371/journal.pone.0305657. eCollection 2024.
2
Sexual Harassment and Prevention Training性骚扰与预防培训
3
The agreement of phonetic transcriptions between paediatric speech and language therapists transcribing a disordered speech sample.儿科言语和语言治疗师转写语音样本的音标转录的一致性。
Int J Lang Commun Disord. 2024 Sep-Oct;59(5):1981-1995. doi: 10.1111/1460-6984.13043. Epub 2024 Jun 8.
4
Neonatal Nurses' Understanding of the Factors That Enhance and Hinder Early Communication Between Preterm Infants and Their Parents: A Narrative Inquiry Study.新生儿护士对促进和阻碍早产儿与其父母早期沟通因素的理解:一项叙事探究研究。
Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70093. doi: 10.1111/1460-6984.70093.
5
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
6
Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture.使用语义语言内容和变压器深度学习架构评估认知能力下降。
Int J Lang Commun Disord. 2024 May-Jun;59(3):1110-1127. doi: 10.1111/1460-6984.12973. Epub 2023 Nov 16.
7
Do you like my voice? Stakeholder perspectives about the acceptability of synthetic child voices in three South African languages.你喜欢我的声音吗?利益相关者对三种南非语言中合成儿童声音可接受性的看法。
Int J Lang Commun Disord. 2025 Jan-Feb;60(1):e13152. doi: 10.1111/1460-6984.13152.
8
The development of a novel, standardized, norm-referenced Arabic Discourse Assessment Tool (ADAT), including an examination of psychometric properties of discourse measures in aphasia.开发一种新型、标准化、基于常模的阿拉伯语语篇评估工具(ADAT),包括评估失语症患者语篇测量的心理测量特性。
Int J Lang Commun Disord. 2024 Sep-Oct;59(5):2103-2117. doi: 10.1111/1460-6984.13083. Epub 2024 Jun 18.
9
"We're all in it together": uniting a diverse range of professionals and people with lived experience within the development of a complex, theory-based paediatric speech and language therapy intervention.“我们同舟共济”:在一项基于理论的复杂儿科言语和语言治疗干预措施的开发过程中,团结各类专业人员以及有实际经验的人士。
Res Involv Engagem. 2025 Jun 19;11(1):67. doi: 10.1186/s40900-025-00738-8.
10
Stigma Management Strategies of Autistic Social Media Users.自闭症社交媒体用户的污名管理策略
Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.

引用本文的文献

1
GATmath and GATLc: Comprehensive benchmarks for evaluating Arabic large language models.GATmath和GATLc:评估阿拉伯语大语言模型的综合基准。
PLoS One. 2025 Sep 2;20(9):e0329129. doi: 10.1371/journal.pone.0329129. eCollection 2025.
2
Correction: Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models.更正:代码混合揭秘:使用机器学习模型增强阿拉伯语方言推文中的仇恨言论检测。
PLoS One. 2025 Aug 13;20(8):e0330305. doi: 10.1371/journal.pone.0330305. eCollection 2025.

本文引用的文献

1
A systematic literature review of hate speech identification on Arabic Twitter data: research challenges and future directions.关于阿拉伯语推特数据中仇恨言论识别的系统文献综述:研究挑战与未来方向。
PeerJ Comput Sci. 2024 Apr 2;10:e1966. doi: 10.7717/peerj-cs.1966. eCollection 2024.
2
Selection of 51 predictors from 13,782 candidate multimodal features using machine learning improves coronary artery disease prediction.使用机器学习从13782个候选多模态特征中选择51个预测因子可改善冠状动脉疾病预测。
Patterns (N Y). 2021 Oct 4;2(12):100364. doi: 10.1016/j.patter.2021.100364. eCollection 2021 Dec 10.
3
Multi-feature fusion framework for sarcasm identification on twitter data: A machine learning based approach.
基于机器学习的多特征融合框架在推特数据中的反讽识别。
PLoS One. 2021 Jun 10;16(6):e0252918. doi: 10.1371/journal.pone.0252918. eCollection 2021.
4
Detecting and Monitoring Hate Speech in Twitter.检测和监测 Twitter 中的仇恨言论。
Sensors (Basel). 2019 Oct 26;19(21):4654. doi: 10.3390/s19214654.
5
Social media and outbreaks of emerging infectious diseases: A systematic review of literature.社交媒体与新发传染病疫情:文献系统综述。
Am J Infect Control. 2018 Sep;46(9):962-972. doi: 10.1016/j.ajic.2018.02.010. Epub 2018 Apr 5.
6
Psychological language on Twitter predicts county-level heart disease mortality.推特上的心理语言可预测县级心脏病死亡率。
Psychol Sci. 2015 Feb;26(2):159-69. doi: 10.1177/0956797614557867. Epub 2015 Jan 20.
7
Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit.加权kappa系数:用于衡量名义尺度上的一致性,并考虑了尺度不一致或部分得分的情况。
Psychol Bull. 1968 Oct;70(4):213-20. doi: 10.1037/h0026256.