Suppr超能文献

基于规则和拼写纠正的社交媒体仇恨言论文本中重复字母的归一化模型。

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction.

机构信息

Center for AI Technology (CAIT), FTSM, Universiti Kebangsaan Malaysia, UKM, Bangi, Malaysia.

Department of Computer Scence, Ibb University, Ibb, Yemen.

出版信息

PLoS One. 2024 Mar 21;19(3):e0299652. doi: 10.1371/journal.pone.0299652. eCollection 2024.

Abstract

As social media booms, abusive online practices such as hate speech have unfortunately increased as well. As letters are often repeated in words used to construct social media messages, these types of words should be eliminated or reduced in number to enhance the efficacy of hate speech detection. Although multiple models have attempted to normalize out-of-vocabulary (OOV) words with repeated letters, they often fail to determine whether the in-vocabulary (IV) replacement words are correct or incorrect. Therefore, this study developed an improved model for normalizing OOV words with repeated letters by replacing them with correct in-vocabulary (IV) replacement words. The improved normalization model is an unsupervised method that does not require the use of a special dictionary or annotated data. It combines rule-based patterns of words with repeated letters and the SymSpell spelling correction algorithm to remove repeated letters within the words by multiple rules regarding the position of repeated letters in a word, be it at the beginning, middle, or end of the word and the repetition pattern. Two hate speech datasets were then used to assess performance. The proposed normalization model was able to decrease the percentage of OOV words to 8%. Its F1 score was also 9% and 13% higher than the models proposed by two extant studies. Therefore, the proposed normalization model performed better than the benchmark studies in replacing OOV words with the correct IV replacement and improved the performance of the detection model. As such, suitable rule-based patterns can be combined with spelling correction to develop a text normalization model to correctly replace words with repeated letters, which would, in turn, improve hate speech detection in texts.

摘要

随着社交媒体的蓬勃发展,不文明的网络行为,例如仇恨言论,也随之泛滥。由于社交媒体信息中经常重复使用某些字母来构成单词,因此,应该减少或消除这些类型的单词,以提高仇恨言论检测的效果。尽管已经有多个模型尝试通过重复字母的规范化来处理词汇表外(OOV)的单词,但它们往往无法确定词汇表内(IV)的替换词是否正确。因此,本研究提出了一种改进的模型,通过用正确的词汇表内替换词来替换具有重复字母的 OOV 单词,从而实现重复字母的规范化。该改进的规范化模型是一种无需使用特殊字典或带注释数据的无监督方法。它结合了具有重复字母的基于规则的单词模式和 SymSpell 拼写纠正算法,通过多个关于重复字母在单词中的位置的规则,如单词的开头、中间或结尾以及重复模式,来删除单词中的重复字母。然后,使用两个仇恨言论数据集来评估性能。所提出的规范化模型能够将 OOV 单词的百分比降低到 8%。其 F1 分数也比两个现有研究提出的模型分别高出 9%和 13%。因此,与基准研究相比,所提出的规范化模型在使用正确的 IV 替换词替换 OOV 单词方面表现更好,并提高了检测模型的性能。因此,可以结合基于规则的模式和拼写纠正来开发文本规范化模型,正确替换具有重复字母的单词,从而提高文本中的仇恨言论检测效果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a04/10956744/84ff0d4f0231/pone.0299652.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验