Raza Muhammad Owais, Mahoto Naeem Ahmed, Hamdi Mohammed, Reshan Mana Saleh Al, Rajab Adel, Shaikh Asadullah
Department of Software Engineering, Mehran University of Engineering and Technology Jamshoro, Jamshoro, Pakistan.
Department of Computer Science, Najran University, Najran, Najran, Saudi Arabia.
PeerJ Comput Sci. 2023 Aug 29;9:e1524. doi: 10.7717/peerj-cs.1524. eCollection 2023.
The use of offensive terms in user-generated content on different social media platforms is one of the major concerns for these platforms. The offensive terms have a negative impact on individuals, which may lead towards the degradation of societal and civilized manners. The immense amount of content generated at a higher speed makes it humanly impossible to categorise and detect offensive terms. Besides, it is an open challenge for natural language processing (NLP) to detect such terminologies automatically. Substantial efforts are made for high-resource languages such as English. However, it becomes more challenging when dealing with resource-poor languages such as Urdu. Because of the lack of standard datasets and pre-processing tools for automatic offensive terms detection. This paper introduces a combinatorial pre-processing approach in developing a classification model for cross-platform (Twitter and YouTube) use. The approach uses datasets from two different platforms (Twitter and YouTube) the training and testing the model, which is trained to apply decision tree, random forest and naive Bayes algorithms. The proposed combinatorial pre-processing approach is applied to check how machine learning models behave with different combinations of standard pre-processing techniques for low-resource language in the cross-platform setting. The experimental results represent the effectiveness of the machine learning model over different subsets of traditional pre-processing approaches in building a classification model for automatic offensive terms detection for a low resource language, , Urdu, in the cross-platform scenario. In the experiments, when dataset D1 is used for training and D2 is applied for testing, the pre-processing approach named Stopword removal produced better results with an accuracy of 83.27%. Whilst, in this case, when dataset D2 is used for training and D1 is applied for testing, stopword removal and punctuation removal were observed as a better preprocessing approach with an accuracy of 74.54%. The combinatorial approach proposed in this paper outperformed the benchmark for the considered datasets using classical as well as ensemble machine learning with an accuracy of 82.9% and 97.2% for dataset D1 and D2, respectively.
不同社交媒体平台上用户生成内容中使用冒犯性词汇是这些平台主要关注的问题之一。冒犯性词汇会对个人产生负面影响,这可能导致社会和文明行为的退化。以更高速度生成的海量内容使得人工对其进行分类和检测冒犯性词汇变得不可能。此外,自动检测此类术语对自然语言处理(NLP)来说是一个公开挑战。对于像英语这样的高资源语言已经做了大量努力。然而,在处理像乌尔都语这样的低资源语言时,挑战变得更大。因为缺乏用于自动检测冒犯性词汇的标准数据集和预处理工具。本文介绍了一种组合预处理方法,用于开发跨平台(推特和YouTube)使用的分类模型。该方法使用来自两个不同平台(推特和YouTube)的数据集来训练和测试模型,该模型经过训练以应用决策树、随机森林和朴素贝叶斯算法。所提出的组合预处理方法用于检查在跨平台设置中,机器学习模型在低资源语言的不同标准预处理技术组合下的表现。实验结果表明,在为低资源语言乌尔都语构建跨平台场景下自动检测冒犯性词汇的分类模型时,机器学习模型在传统预处理方法的不同子集上具有有效性。在实验中,当使用数据集D1进行训练并使用D2进行测试时,名为停用词去除的预处理方法产生了更好的结果,准确率为83.27%。同时,在这种情况下,当使用数据集D2进行训练并使用D1进行测试时,停用词去除和标点去除被视为更好的预处理方法,准确率为74.54%。本文提出的组合方法在使用经典机器学习和集成机器学习方面优于所考虑数据集的基准,数据集D1和D2的准确率分别为82.9%和97.2%。