Suppr超能文献

基于转换器模型的罗曼 Urdu 仇恨言论检测在网络安全应用中的研究

Roman Urdu Hate Speech Detection Using Transformer-Based Model for Cyber Security Applications.

机构信息

Department of Computer Science, Islamia College Peshawar, Peshawar 25130, Pakistan.

Malaysian Institute of Information Technology, Universiti Kuala Lumpur, Kuala Lumpur 50250, Malaysia.

出版信息

Sensors (Basel). 2023 Apr 12;23(8):3909. doi: 10.3390/s23083909.

Abstract

Social media applications, such as Twitter and Facebook, allow users to communicate and share their thoughts, status updates, opinions, photographs, and videos around the globe. Unfortunately, some people utilize these platforms to disseminate hate speech and abusive language. The growth of hate speech may result in hate crimes, cyber violence, and substantial harm to cyberspace, physical security, and social safety. As a result, hate speech detection is a critical issue for both cyberspace and physical society, necessitating the development of a robust application capable of detecting and combating it in real-time. Hate speech detection is a context-dependent problem that requires context-aware mechanisms for resolution. In this study, we employed a transformer-based model for Roman Urdu hate speech classification due to its ability to capture the text context. In addition, we developed the first Roman Urdu pre-trained BERT model, which we named BERT-RU. For this purpose, we exploited the capabilities of BERT by training it from scratch on the largest Roman Urdu dataset consisting of 173,714 text messages. Traditional and deep learning models were used as baseline models, including LSTM, BiLSTM, BiLSTM + Attention Layer, and CNN. We also investigated the concept of transfer learning by using pre-trained BERT embeddings in conjunction with deep learning models. The performance of each model was evaluated in terms of accuracy, precision, recall, and F-measure. The generalization of each model was evaluated on a cross-domain dataset. The experimental results revealed that the transformer-based model, when directly applied to the classification task of the Roman Urdu hate speech, outperformed traditional machine learning, deep learning models, and pre-trained transformer-based models in terms of accuracy, precision, recall, and F-measure, with scores of 96.70%, 97.25%, 96.74%, and 97.89%, respectively. In addition, the transformer-based model exhibited superior generalization on a cross-domain dataset.

摘要

社交媒体应用程序,如 Twitter 和 Facebook,允许用户在全球范围内进行交流和分享他们的想法、状态更新、意见、照片和视频。不幸的是,有些人利用这些平台传播仇恨言论和辱骂性语言。仇恨言论的增长可能导致仇恨犯罪、网络暴力和对网络空间、人身安全和社会安全的重大伤害。因此,仇恨言论检测对于网络空间和现实社会都是一个至关重要的问题,需要开发一种强大的应用程序,能够实时检测和打击仇恨言论。仇恨言论检测是一个依赖上下文的问题,需要上下文感知机制来解决。在这项研究中,我们使用了基于转换器的模型来进行罗马乌尔都语仇恨言论分类,因为它能够捕捉文本上下文。此外,我们还开发了第一个罗马乌尔都语预训练 BERT 模型,我们称之为 BERT-RU。为此,我们利用 BERT 的能力,在最大的罗马乌尔都语数据集上从零开始训练它,该数据集包含 173714 条短信。传统的和深度学习模型被用作基线模型,包括 LSTM、BiLSTM、BiLSTM+Attention Layer 和 CNN。我们还研究了迁移学习的概念,即在深度学习模型中使用预训练的 BERT 嵌入。我们根据准确性、精度、召回率和 F1 分数来评估每个模型的性能。我们还在一个跨领域数据集上评估了每个模型的泛化能力。实验结果表明,基于转换器的模型在直接应用于罗马乌尔都语仇恨言论分类任务时,在准确性、精度、召回率和 F1 分数方面都优于传统的机器学习、深度学习模型和预训练的转换器模型,分别达到了 96.70%、97.25%、96.74%和 97.89%。此外,基于转换器的模型在跨领域数据集上表现出了更好的泛化能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/169b/10143294/442697210708/sensors-23-03909-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验