• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用机器学习算法检测资源匮乏语言中的冒犯性词汇。

Detection of offensive terms in resource-poor language using machine learning algorithms.

作者信息

Raza Muhammad Owais, Mahoto Naeem Ahmed, Hamdi Mohammed, Reshan Mana Saleh Al, Rajab Adel, Shaikh Asadullah

机构信息

Department of Software Engineering, Mehran University of Engineering and Technology Jamshoro, Jamshoro, Pakistan.

Department of Computer Science, Najran University, Najran, Najran, Saudi Arabia.

出版信息

PeerJ Comput Sci. 2023 Aug 29;9:e1524. doi: 10.7717/peerj-cs.1524. eCollection 2023.

DOI:10.7717/peerj-cs.1524
PMID:37705647
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10496005/
Abstract

The use of offensive terms in user-generated content on different social media platforms is one of the major concerns for these platforms. The offensive terms have a negative impact on individuals, which may lead towards the degradation of societal and civilized manners. The immense amount of content generated at a higher speed makes it humanly impossible to categorise and detect offensive terms. Besides, it is an open challenge for natural language processing (NLP) to detect such terminologies automatically. Substantial efforts are made for high-resource languages such as English. However, it becomes more challenging when dealing with resource-poor languages such as Urdu. Because of the lack of standard datasets and pre-processing tools for automatic offensive terms detection. This paper introduces a combinatorial pre-processing approach in developing a classification model for cross-platform (Twitter and YouTube) use. The approach uses datasets from two different platforms (Twitter and YouTube) the training and testing the model, which is trained to apply decision tree, random forest and naive Bayes algorithms. The proposed combinatorial pre-processing approach is applied to check how machine learning models behave with different combinations of standard pre-processing techniques for low-resource language in the cross-platform setting. The experimental results represent the effectiveness of the machine learning model over different subsets of traditional pre-processing approaches in building a classification model for automatic offensive terms detection for a low resource language, , Urdu, in the cross-platform scenario. In the experiments, when dataset D1 is used for training and D2 is applied for testing, the pre-processing approach named Stopword removal produced better results with an accuracy of 83.27%. Whilst, in this case, when dataset D2 is used for training and D1 is applied for testing, stopword removal and punctuation removal were observed as a better preprocessing approach with an accuracy of 74.54%. The combinatorial approach proposed in this paper outperformed the benchmark for the considered datasets using classical as well as ensemble machine learning with an accuracy of 82.9% and 97.2% for dataset D1 and D2, respectively.

摘要

不同社交媒体平台上用户生成内容中使用冒犯性词汇是这些平台主要关注的问题之一。冒犯性词汇会对个人产生负面影响,这可能导致社会和文明行为的退化。以更高速度生成的海量内容使得人工对其进行分类和检测冒犯性词汇变得不可能。此外,自动检测此类术语对自然语言处理(NLP)来说是一个公开挑战。对于像英语这样的高资源语言已经做了大量努力。然而,在处理像乌尔都语这样的低资源语言时,挑战变得更大。因为缺乏用于自动检测冒犯性词汇的标准数据集和预处理工具。本文介绍了一种组合预处理方法,用于开发跨平台(推特和YouTube)使用的分类模型。该方法使用来自两个不同平台(推特和YouTube)的数据集来训练和测试模型,该模型经过训练以应用决策树、随机森林和朴素贝叶斯算法。所提出的组合预处理方法用于检查在跨平台设置中,机器学习模型在低资源语言的不同标准预处理技术组合下的表现。实验结果表明,在为低资源语言乌尔都语构建跨平台场景下自动检测冒犯性词汇的分类模型时,机器学习模型在传统预处理方法的不同子集上具有有效性。在实验中,当使用数据集D1进行训练并使用D2进行测试时,名为停用词去除的预处理方法产生了更好的结果,准确率为83.27%。同时,在这种情况下,当使用数据集D2进行训练并使用D1进行测试时,停用词去除和标点去除被视为更好的预处理方法,准确率为74.54%。本文提出的组合方法在使用经典机器学习和集成机器学习方面优于所考虑数据集的基准,数据集D1和D2的准确率分别为82.9%和97.2%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/31679864cca0/peerj-cs-09-1524-g016.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e8864dade387/peerj-cs-09-1524-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/9fc8582e0088/peerj-cs-09-1524-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e07649e1a40d/peerj-cs-09-1524-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e6d409be0b4b/peerj-cs-09-1524-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/30525b35a552/peerj-cs-09-1524-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/94fa0705004e/peerj-cs-09-1524-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/ad82da9cef0e/peerj-cs-09-1524-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/da47ea077e01/peerj-cs-09-1524-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e607c612826f/peerj-cs-09-1524-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/ef93051534d8/peerj-cs-09-1524-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/b87cf8622080/peerj-cs-09-1524-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e805fcf91169/peerj-cs-09-1524-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/a52a589bf9e9/peerj-cs-09-1524-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/fd39812082fa/peerj-cs-09-1524-g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/2e7655adf0bd/peerj-cs-09-1524-g015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/31679864cca0/peerj-cs-09-1524-g016.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e8864dade387/peerj-cs-09-1524-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/9fc8582e0088/peerj-cs-09-1524-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e07649e1a40d/peerj-cs-09-1524-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e6d409be0b4b/peerj-cs-09-1524-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/30525b35a552/peerj-cs-09-1524-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/94fa0705004e/peerj-cs-09-1524-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/ad82da9cef0e/peerj-cs-09-1524-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/da47ea077e01/peerj-cs-09-1524-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e607c612826f/peerj-cs-09-1524-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/ef93051534d8/peerj-cs-09-1524-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/b87cf8622080/peerj-cs-09-1524-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/e805fcf91169/peerj-cs-09-1524-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/a52a589bf9e9/peerj-cs-09-1524-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/fd39812082fa/peerj-cs-09-1524-g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/2e7655adf0bd/peerj-cs-09-1524-g015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2774/10496005/31679864cca0/peerj-cs-09-1524-g016.jpg

相似文献

1
Detection of offensive terms in resource-poor language using machine learning algorithms.使用机器学习算法检测资源匮乏语言中的冒犯性词汇。
PeerJ Comput Sci. 2023 Aug 29;9:e1524. doi: 10.7717/peerj-cs.1524. eCollection 2023.
2
Offensive language detection in low resource languages: A use case of Persian language.低资源语言中的攻击性语言检测:以波斯语为例。
PLoS One. 2024 Jun 21;19(6):e0304166. doi: 10.1371/journal.pone.0304166. eCollection 2024.
3
Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT.普什图语冒犯性语言检测:一个基准数据集和单语普什图语BERT
PeerJ Comput Sci. 2023 Oct 18;9:e1617. doi: 10.7717/peerj-cs.1617. eCollection 2023.
4
Investigating cross-lingual training for offensive language detection.研究用于冒犯性语言检测的跨语言训练。
PeerJ Comput Sci. 2021 Jun 25;7:e559. doi: 10.7717/peerj-cs.559. eCollection 2021.
5
Roman Urdu Hate Speech Detection Using Transformer-Based Model for Cyber Security Applications.基于转换器模型的罗曼 Urdu 仇恨言论检测在网络安全应用中的研究
Sensors (Basel). 2023 Apr 12;23(8):3909. doi: 10.3390/s23083909.
6
Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data.基于深度学习的多语言混合数据情感分析和攻击性语言识别。
Sci Rep. 2022 Dec 13;12(1):21557. doi: 10.1038/s41598-022-26092-3.
7
Identification of offensive language in Urdu using semantic and embedding models.使用语义和嵌入模型识别乌尔都语中的冒犯性语言。
PeerJ Comput Sci. 2022 Dec 12;8:e1169. doi: 10.7717/peerj-cs.1169. eCollection 2022.
8
Normalized effect size (NES): a novel feature selection model for Urdu fake news classification.归一化效应大小(NES):一种用于乌尔都语假新闻分类的新型特征选择模型。
PeerJ Comput Sci. 2023 Oct 24;9:e1612. doi: 10.7717/peerj-cs.1612. eCollection 2023.
9
The influence of preprocessing on text classification using a bag-of-words representation.基于词袋模型的文本分类中预处理的影响。
PLoS One. 2020 May 1;15(5):e0232525. doi: 10.1371/journal.pone.0232525. eCollection 2020.
10
Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid.开发一个自动系统来对 Twitter 上有关医疗服务的闲聊进行分类:以医疗补助计划为例。
J Med Internet Res. 2021 May 3;23(5):e26616. doi: 10.2196/26616.

本文引用的文献

1
Identification of offensive language in Urdu using semantic and embedding models.使用语义和嵌入模型识别乌尔都语中的冒犯性语言。
PeerJ Comput Sci. 2022 Dec 12;8:e1169. doi: 10.7717/peerj-cs.1169. eCollection 2022.