Hashir Muhammad Haseeb, Kim Sung Won
Information and Communication Engineering, Yeungnam University, Gyeongsan, Gyeongbuk, Republic of South Korea.
School of Computer Science and Engineering, Yeungnam University, Gyeongsan, Gyeongbuk, Republic of South Korea.
PeerJ Comput Sci. 2025 May 30;11:e2911. doi: 10.7717/peerj-cs.2911. eCollection 2025.
The proliferation of user-generated content on social networking sites has intensified the challenge of accurately and efficiently detecting inflammatory and discriminatory speech at scale. Traditional manual moderation methods are impractical due to the sheer volume and complexity of online discourse, necessitating automated solutions. However, existing deep learning models for hate speech detection typically function as black-box systems, providing binary classifications without interpretable insights into their decision-making processes. This opacity significantly limits their practical utility, particularly in nuanced content moderation tasks. To address this challenge, our research explores leveraging the advanced reasoning and knowledge integration capabilities of state-of-the-art language models, specifically Mistral-7B, to develop transparent hate speech detection systems. We introduce a novel framework wherein large language models (LLMs) generate explicit rationales by identifying and analyzing critical textual features indicative of hate speech. These rationales are subsequently integrated into specialized classifiers designed to perform explainable content moderation. We rigorously evaluate our methodology on multiple benchmark English-language social media datasets. Results demonstrate that incorporating LLM-generated explanations significantly enhances both the interpretability and accuracy of hate speech detection. This approach not only identifies problematic content effectively but also clearly articulates the analytical rationale behind each decision, fulfilling the critical demand for transparency in automated content moderation.
社交网站上用户生成内容的激增,加大了大规模准确高效检测煽动性和歧视性言论的挑战。由于在线言论的数量庞大且复杂,传统的人工审核方法不切实际,因此需要自动化解决方案。然而,现有的用于仇恨言论检测的深度学习模型通常作为黑箱系统运行,提供二元分类,却无法对其决策过程给出可解释的见解。这种不透明性极大地限制了它们的实际效用,尤其是在细微的内容审核任务中。为应对这一挑战,我们的研究探索利用最先进语言模型(特别是米斯特拉尔 - 7B)的高级推理和知识整合能力,来开发透明的仇恨言论检测系统。我们引入了一个新颖的框架,其中大语言模型(LLMs)通过识别和分析表明仇恨言论的关键文本特征来生成明确的理由。这些理由随后被整合到专门设计用于执行可解释内容审核的分类器中。我们在多个基准英语社交媒体数据集上对我们的方法进行了严格评估。结果表明,纳入大语言模型生成的解释显著提高了仇恨言论检测的可解释性和准确性。这种方法不仅能有效识别有问题的内容,还能清晰阐明每个决策背后的分析理由,满足了自动内容审核中对透明度的关键需求。