Alhazmi Ali, Mahmud Rohana, Idris Norisma, Mohamed Abo Mohamed Elhag, Eke Christopher Ifeanyi
Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Malaysia.
Department of Computer Science, College of Engineering and Computer Science, Jazan University, Jazan, Saudi Arabia.
PLoS One. 2024 Jul 17;19(7):e0305657. doi: 10.1371/journal.pone.0305657. eCollection 2024.
Technological developments over the past few decades have changed the way people communicate, with platforms like social media and blogs becoming vital channels for international conversation. Even though hate speech is vigorously suppressed on social media, it is still a concern that needs to be constantly recognized and observed. The Arabic language poses particular difficulties in the detection of hate speech, despite the considerable efforts made in this area for English-language social media content. Arabic calls for particular consideration when it comes to hate speech detection because of its many dialects and linguistic nuances. Another degree of complication is added by the widespread practice of "code-mixing," in which users merge various languages smoothly. Recognizing this research vacuum, the study aims to close it by examining how well machine learning models containing variation features can detect hate speech, especially when it comes to Arabic tweets featuring code-mixing. Therefore, the objective of this study is to assess and compare the effectiveness of different features and machine learning models for hate speech detection on Arabic hate speech and code-mixing hate speech datasets. To achieve the objectives, the methodology used includes data collection, data pre-processing, feature extraction, the construction of classification models, and the evaluation of the constructed classification models. The findings from the analysis revealed that the TF-IDF feature, when employed with the SGD model, attained the highest accuracy, reaching 98.21%. Subsequently, these results were contrasted with outcomes from three existing studies, and the proposed method outperformed them, underscoring the significance of the proposed method. Consequently, our study carries practical implications and serves as a foundational exploration in the realm of automated hate speech detection in text.
过去几十年的技术发展改变了人们的交流方式,社交媒体和博客等平台已成为国际交流的重要渠道。尽管社交媒体大力压制仇恨言论,但它仍是一个需要持续关注和审视的问题。在检测仇恨言论方面,阿拉伯语存在特殊困难,尽管在检测英语社交媒体内容方面已付出了相当大的努力。由于阿拉伯语有众多方言和语言细微差别,在检测仇恨言论时需要特别考虑。“语码混合”的广泛使用又增加了一层复杂性,即用户能顺畅地融合多种语言。认识到这一研究空白,本研究旨在通过考察包含变异特征的机器学习模型在检测仇恨言论方面的表现来填补这一空白,尤其是检测带有语码混合的阿拉伯语推文时的表现。因此,本研究的目的是评估和比较不同特征及机器学习模型在阿拉伯语仇恨言论和语码混合仇恨言论数据集上检测仇恨言论的有效性。为实现这些目标,所采用的方法包括数据收集、数据预处理、特征提取、分类模型构建以及对构建好的分类模型进行评估。分析结果显示,TF-IDF特征与SGD模型结合使用时,准确率最高,达到了98.21%。随后,将这些结果与三项现有研究的结果进行了对比,结果表明所提出的方法优于它们,凸显了该方法的重要性。因此,我们的研究具有实际意义,是文本中自动仇恨言论检测领域的一项基础探索。