基于规则和拼写纠正的社交媒体仇恨言论文本中重复字母的归一化模型。

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction.

机构信息

Center for AI Technology (CAIT), FTSM, Universiti Kebangsaan Malaysia, UKM, Bangi, Malaysia.

Department of Computer Scence, Ibb University, Ibb, Yemen.

出版信息

PLoS One. 2024 Mar 21;19(3):e0299652. doi: 10.1371/journal.pone.0299652. eCollection 2024.

DOI:10.1371/journal.pone.0299652

PMID:38512966

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10956744/

Abstract

As social media booms, abusive online practices such as hate speech have unfortunately increased as well. As letters are often repeated in words used to construct social media messages, these types of words should be eliminated or reduced in number to enhance the efficacy of hate speech detection. Although multiple models have attempted to normalize out-of-vocabulary (OOV) words with repeated letters, they often fail to determine whether the in-vocabulary (IV) replacement words are correct or incorrect. Therefore, this study developed an improved model for normalizing OOV words with repeated letters by replacing them with correct in-vocabulary (IV) replacement words. The improved normalization model is an unsupervised method that does not require the use of a special dictionary or annotated data. It combines rule-based patterns of words with repeated letters and the SymSpell spelling correction algorithm to remove repeated letters within the words by multiple rules regarding the position of repeated letters in a word, be it at the beginning, middle, or end of the word and the repetition pattern. Two hate speech datasets were then used to assess performance. The proposed normalization model was able to decrease the percentage of OOV words to 8%. Its F1 score was also 9% and 13% higher than the models proposed by two extant studies. Therefore, the proposed normalization model performed better than the benchmark studies in replacing OOV words with the correct IV replacement and improved the performance of the detection model. As such, suitable rule-based patterns can be combined with spelling correction to develop a text normalization model to correctly replace words with repeated letters, which would, in turn, improve hate speech detection in texts.

摘要

随着社交媒体的蓬勃发展，不文明的网络行为，例如仇恨言论，也随之泛滥。由于社交媒体信息中经常重复使用某些字母来构成单词，因此，应该减少或消除这些类型的单词，以提高仇恨言论检测的效果。尽管已经有多个模型尝试通过重复字母的规范化来处理词汇表外（OOV）的单词，但它们往往无法确定词汇表内（IV）的替换词是否正确。因此，本研究提出了一种改进的模型，通过用正确的词汇表内替换词来替换具有重复字母的 OOV 单词，从而实现重复字母的规范化。该改进的规范化模型是一种无需使用特殊字典或带注释数据的无监督方法。它结合了具有重复字母的基于规则的单词模式和 SymSpell 拼写纠正算法，通过多个关于重复字母在单词中的位置的规则，如单词的开头、中间或结尾以及重复模式，来删除单词中的重复字母。然后，使用两个仇恨言论数据集来评估性能。所提出的规范化模型能够将 OOV 单词的百分比降低到 8%。其 F1 分数也比两个现有研究提出的模型分别高出 9%和 13%。因此，与基准研究相比，所提出的规范化模型在使用正确的 IV 替换词替换 OOV 单词方面表现更好，并提高了检测模型的性能。因此，可以结合基于规则的模式和拼写纠正来开发文本规范化模型，正确替换具有重复字母的单词，从而提高文本中的仇恨言论检测效果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a04/10956744/84ff0d4f0231/pone.0299652.g001.jpg

相似文献

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction.基于规则和拼写纠正的社交媒体仇恨言论文本中重复字母的归一化模型。

PLoS One. 2024 Mar 21;19(3):e0299652. doi: 10.1371/journal.pone.0299652. eCollection 2024.

Emphasizing unseen words: New vocabulary acquisition for end-to-end speech recognition.强调未见过的单词：用于端到端语音识别的新词汇习得

Neural Netw. 2023 Apr;161:494-504. doi: 10.1016/j.neunet.2023.01.027. Epub 2023 Feb 10.

A curated dataset for hate speech detection on social media text.一个用于社交媒体文本仇恨言论检测的精选数据集。

Data Brief. 2022 Dec 17;46:108832. doi: 10.1016/j.dib.2022.108832. eCollection 2023 Feb.

Linguistic Patterns for Code Word Resilient Hate Speech Identification.用于代码词抗扰仇恨言论识别的语言模式。

Sensors (Basel). 2021 Nov 25;21(23):7859. doi: 10.3390/s21237859.

Hate speech detection and racial bias mitigation in social media based on BERT model.基于 BERT 模型的社交媒体中的仇恨言论检测和种族偏见缓解。

PLoS One. 2020 Aug 27;15(8):e0237861. doi: 10.1371/journal.pone.0237861. eCollection 2020.

Hate speech and abusive language detection in Indonesian social media: Progress and challenges.印度尼西亚社交媒体中的仇恨言论和辱骂性语言检测：进展与挑战。

Heliyon. 2023 Jul 28;9(8):e18647. doi: 10.1016/j.heliyon.2023.e18647. eCollection 2023 Aug.

Moralized language predicts hate speech on social media.道德化语言预示着社交媒体上的仇恨言论。

PNAS Nexus. 2022 Dec 7;2(1):pgac281. doi: 10.1093/pnasnexus/pgac281. eCollection 2023 Jan.

Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models.代码混合揭秘：使用机器学习模型增强阿拉伯方言推文中的仇恨言论检测

PLoS One. 2024 Jul 17;19(7):e0305657. doi: 10.1371/journal.pone.0305657. eCollection 2024.

Detection of Hate Speech in COVID-19-Related Tweets in the Arab Region: Deep Learning and Topic Modeling Approach.检测阿拉伯地区与 COVID-19 相关推文的仇恨言论：深度学习和主题建模方法。

J Med Internet Res. 2020 Dec 8;22(12):e22609. doi: 10.2196/22609.

Roman Urdu Hate Speech Detection Using Transformer-Based Model for Cyber Security Applications.基于转换器模型的罗曼 Urdu 仇恨言论检测在网络安全应用中的研究

Sensors (Basel). 2023 Apr 12;23(8):3909. doi: 10.3390/s23083909.

本文引用的文献

Offline events and online hate.线下活动与网络仇恨

PLoS One. 2023 Jan 25;18(1):e0278511. doi: 10.1371/journal.pone.0278511. eCollection 2023.

Abusive Language Detection in Online Conversations by Combining Content- and Graph-Based Features.通过结合基于内容和基于图的特征来检测在线对话中的辱骂性语言。

Front Big Data. 2019 Jun 4;2:8. doi: 10.3389/fdata.2019.00008. eCollection 2019.

Hate speech detection: Challenges and solutions.仇恨言论检测：挑战与解决方案。

PLoS One. 2019 Aug 20;14(8):e0221152. doi: 10.1371/journal.pone.0221152. eCollection 2019.

Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts.基于神经网络改进社交媒体文本中作者画像的特征表示

Comput Intell Neurosci. 2016;2016:1638936. doi: 10.1155/2016/1638936. Epub 2016 Oct 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于规则和拼写纠正的社交媒体仇恨言论文本中重复字母的归一化模型。

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction.

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献