Suppr超能文献

机器学习和词汇方法在在线讨论中毒性程度检测中的文本处理。

Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions.

机构信息

Department of Cybernetics and Artificial Intelligence, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Letná 9, 04200 Kosice, Slovakia.

出版信息

Sensors (Basel). 2022 Aug 27;22(17):6468. doi: 10.3390/s22176468.

Abstract

This article focuses on the problem of detecting toxicity in online discussions. Toxicity is currently a serious problem when people are largely influenced by opinions on social networks. We offer a solution based on classification models using machine learning methods to classify short texts on social networks into multiple degrees of toxicity. The classification models used both classic methods of machine learning, such as naïve Bayes and SVM (support vector machine) as well ensemble methods, such as bagging and RF (random forest). The models were created using text data, which we extracted from social networks in the Slovak language. The labelling of our dataset of short texts into multiple classes-the degrees of toxicity-was provided automatically by our method based on the lexicon approach to texts processing. This lexicon method required creating a dictionary of toxic words in the Slovak language, which is another contribution of the work. Finally, an application was created based on the learned machine learning models, which can be used to detect the degree of toxicity of new social network comments as well as for experimentation with various machine learning methods. We achieved the best results using an SVM-average value of accuracy = 0.89 and F1 = 0.79. This model also outperformed the ensemble learning by the RF and Bagging methods; however, the ensemble learning methods achieved better results than the naïve Bayes method.

摘要

这篇文章主要关注在线讨论中检测毒性的问题。毒性是当前人们在很大程度上受到社交网络上的意见影响时面临的一个严重问题。我们提供了一种基于分类模型的解决方案,使用机器学习方法将社交网络上的短文本分类为多个毒性程度。所使用的分类模型既包括经典的机器学习方法,如朴素贝叶斯和支持向量机(SVM),也包括集成方法,如袋装和随机森林(RF)。这些模型是使用从斯洛伐克语社交网络中提取的文本数据创建的。我们的短文本数据集的标签分为多个类别-毒性程度-是由我们基于词汇方法的文本处理方法自动提供的。这种词汇方法需要创建一个斯洛伐克语有毒词汇的字典,这也是这项工作的另一个贡献。最后,根据学习到的机器学习模型创建了一个应用程序,该应用程序可用于检测新的社交网络评论的毒性程度,以及用于各种机器学习方法的实验。我们使用 SVM 获得了最佳结果-平均准确率=0.89,F1=0.79。该模型还优于 RF 和 Bagging 集成学习方法;但是,集成学习方法的效果优于朴素贝叶斯方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b220/9459955/3bc281424fd8/sensors-22-06468-g006.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验