Suppr超能文献

多意识形态、多类别的在线极端主义数据集及其机器学习评估。

Multi-Ideology, Multiclass Online Extremism Dataset, and Its Evaluation Using Machine Learning.

机构信息

Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, MH 412115, India.

MIT Art, Design and Technology University, Pune, MH 412201, India.

出版信息

Comput Intell Neurosci. 2023 Mar 1;2023:4563145. doi: 10.1155/2023/4563145. eCollection 2023.

Abstract

Social media platforms play a key role in fostering the outreach of extremism by influencing the views, opinions, and perceptions of people. These platforms are increasingly exploited by extremist elements for spreading propaganda, radicalizing, and recruiting youth. Hence, research on extremism detection on social media platforms is essential to curb its influence and ill effects. A study of existing literature on extremism detection reveals that it is restricted to a specific ideology, binary classification with limited insights on extremism text, and manual data validation methods to check data quality. In existing research studies, researchers have used datasets limited to a single ideology. As a result, they face serious issues such as class imbalance, limited insights with class labels, and a lack of automated data validation methods. A major contribution of this work is a balanced extremism text dataset, versatile with multiple ideologies verified by robust data validation methods for classifying extremism text into popular extremism types such as , and . The presented extremism text dataset is a generalization of multiple ideologies such as the standard ISIS dataset, GAB White Supremacist dataset, and recent Twitter tweets on ISIS and white supremacist ideology. The dataset is analyzed to extract features for the three focused classes in extremism with TF-IDF unigram, bigrams, and trigrams features. Additionally, pretrained word2vec features are used for semantic analysis. The extracted features in the proposed dataset are evaluated using machine learning classification algorithms such as , and algorithms. The best results were achieved by support vector machine using the TF-IDF unigram model confirming 0.67 F1 score. The proposed multi-ideology and multiclass dataset shows comparable performance to the existing datasets limited to single ideology and binary labels.

摘要

社交媒体平台通过影响人们的观点、意见和看法,在助长极端主义的传播方面发挥着关键作用。这些平台正越来越多地被极端分子利用来传播宣传、煽动和招募青年。因此,对社交媒体平台上的极端主义检测进行研究对于遏制其影响和不良后果至关重要。对现有的极端主义检测文献的研究表明,它仅限于特定的意识形态,对极端主义文本的二元分类的洞察力有限,以及手动数据验证方法来检查数据质量。在现有的研究中,研究人员使用的数据集仅限于单一的意识形态。因此,他们面临着严重的问题,如类别不平衡、带有类别标签的有限洞察力,以及缺乏自动化的数据验证方法。这项工作的一个主要贡献是一个平衡的极端主义文本数据集,具有多种意识形态,通过强大的数据验证方法进行验证,可将极端主义文本分类为流行的极端主义类型,如伊斯兰国、白人至上主义和新的关于伊斯兰国和白人至上主义意识形态的推文。所提出的极端主义文本数据集是对多种意识形态的概括,如标准的 ISIS 数据集、GAB 白人至上主义数据集和最近关于 ISIS 和白人至上主义意识形态的 Twitter 推文。对数据集进行分析以提取三个重点类别的特征,包括 TF-IDF 一元、二元和三元特征。此外,还使用预训练的 word2vec 特征进行语义分析。在所提出的数据集上,使用机器学习分类算法(如,和 算法)评估提取的特征。使用 TF-IDF 一元模型的支持向量机算法获得了最佳结果,证实了 0.67 的 F1 得分。所提出的多意识形态和多类数据集的性能与限于单一意识形态和二元标签的现有数据集相当。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4dde/9995191/53d719f2293d/CIN2023-4563145.001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验