基于混合机器学习模型和超参数优化的罗马 Urdu 仇恨言论检测

Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization.

机构信息

Department of Software Engineering, University of Management and Technology, Lahore, 54590, Pakistan.

Department of Computer Science, University of Management and Technology, Lahore, 54590, Pakistan.

出版信息

Sci Rep. 2024 Nov 19;14(1):28590. doi: 10.1038/s41598-024-79106-7.

DOI:10.1038/s41598-024-79106-7

PMID:39562608

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11576869/

Abstract

With the rapid increase of users over social media, cyberbullying, and hate speech problems have arisen over the past years. Automatic hate speech detection (HSD) from text is an emerging research problem in natural language processing (NLP). Researchers developed various approaches to solve the automatic hate speech detection problem using different corpora in various languages, however, research on the Urdu language is rather scarce. This study aims to address the HSD task on Twitter using Roman Urdu text. The contribution of this research is the development of a hybrid model for Roman Urdu HSD, which has not been previously explored. The novel hybrid model integrates deep learning (DL) and transformer models for automatic feature extraction, combined with machine learning algorithms (MLAs) for classification. To further enhance model performance, we employ several hyperparameter optimization (HPO) techniques, including Grid Search (GS), Randomized Search (RS), and Bayesian Optimization with Gaussian Processes (BOGP). Evaluation is carried out on two publicly available benchmarks Roman Urdu corpora comprising HS-RU-20 corpus and RUHSOLD hate speech corpus. Results demonstrate that the Multilingual BERT (MBERT) feature learner, paired with a Support Vector Machine (SVM) classifier and optimized using RS, achieves state-of-the-art performance. On the HS-RU-20 corpus, this model attained an accuracy of 0.93 and an F1 score of 0.95 for the Neutral-Hostile classification task, and an accuracy of 0.89 with an F1 score of 0.88 for the Hate Speech-Offensive task. On the RUHSOLD corpus, the same model achieved an accuracy of 0.95 and an F1 score of 0.94 for the Coarse-grained task, alongside an accuracy of 0.87 and an F1 score of 0.84 for the Fine-grained task. These results demonstrate the effectiveness of our hybrid approach for Roman Urdu hate speech detection.

摘要

随着社交媒体用户的快速增长，过去几年出现了网络欺凌和仇恨言论问题。自动仇恨言论检测（HSD）是自然语言处理（NLP）中的一个新兴研究问题。研究人员使用不同语言的不同语料库开发了各种方法来解决自动仇恨言论检测问题，然而，针对乌尔都语的研究却相当匮乏。本研究旨在使用罗马乌尔都语文本解决 Twitter 上的 HSD 任务。本研究的贡献在于开发了一种用于罗马乌尔都语 HSD 的混合模型，这在以前的研究中尚未探索过。该新型混合模型集成了深度学习（DL）和转换器模型用于自动特征提取，结合机器学习算法（MLA）进行分类。为了进一步提高模型性能，我们采用了几种超参数优化（HPO）技术，包括网格搜索（GS）、随机搜索（RS）和带有高斯过程的贝叶斯优化（BOGP）。在两个公开可用的罗马乌尔都语语料库上进行了评估，包括 HS-RU-20 语料库和 RUHSOLD 仇恨言论语料库。结果表明，多语言 BERT（MBERT）特征学习器与支持向量机（SVM）分类器相结合，并使用 RS 进行优化，实现了最先进的性能。在 HS-RU-20 语料库上，该模型在 Neutral-Hostile 分类任务中的准确率为 0.93，F1 得分为 0.95，在 Hate Speech-Offensive 任务中的准确率为 0.89，F1 得分为 0.88。在 RUHSOLD 语料库上，同一模型在 Coarse-grained 任务中的准确率为 0.95，F1 得分为 0.94，在 Fine-grained 任务中的准确率为 0.87，F1 得分为 0.84。这些结果表明，我们的混合方法对罗马乌尔都语仇恨言论检测是有效的。