Mostafa Mohamed A, Almogren Ahmad
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
Chair of Cyber Security, Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
PeerJ Comput Sci. 2024 Oct 30;10:e2432. doi: 10.7717/peerj-cs.2432. eCollection 2024.
The proliferation of fake news on social media platforms necessitates the development of reliable datasets for effective fake news detection and veracity analysis. In this article, we introduce a veracity dataset of Arabic tweets called "VERA-ARAB", a pioneering large-scale dataset designed to enhance fake news detection in Arabic tweets. VERA-ARAB is a balanced, multi-domain, and multi-dialectal dataset, containing both fake and true news, meticulously verified by fact-checking experts from Misbar. Comprising approximately 20,000 tweets from 13,000 distinct users and covering 884 different claims, the dataset includes detailed information such as news text, user details, and spatiotemporal data, spanning diverse domains like sports and politics. We leveraged the X API to retrieve and structure the dataset, providing a comprehensive data dictionary to describe the raw data and conducting a thorough statistical descriptive analysis. This analysis reveals insightful patterns and distributions, visualized according to data type and nature. We also evaluated the dataset using multiple machine learning classification models, exploring various social and textual features. Our findings indicate promising results, particularly with textual features, underscoring the dataset's potential for enhancing fake news detection. Furthermore, we outline future work aimed at expanding VERA-ARAB to establish it as a benchmark for Arabic tweets in fake news detection. We also discuss other potential applications that could leverage the VERA-ARAB dataset, emphasizing its value and versatility for advancing the field of fake news detection in Arabic social media. Potential applications include user veracity assessment, topic modeling, and named entity recognition, demonstrating the dataset's wide-ranging utility for broader research in information quality management on social media.
社交媒体平台上虚假新闻的泛滥使得有必要开发可靠的数据集,以进行有效的虚假新闻检测和真实性分析。在本文中,我们介绍了一个名为“VERA-ARAB”的阿拉伯语推文真实性数据集,这是一个开创性的大规模数据集,旨在增强对阿拉伯语推文中虚假新闻的检测。VERA-ARAB是一个平衡的、多领域的、多方言的数据集,包含虚假新闻和真实新闻,均经过Misbar的事实核查专家精心验证。该数据集包含来自13000个不同用户的约20000条推文,涵盖884个不同的声明,包括新闻文本、用户详细信息和时空数据等详细信息,涉及体育和政治等不同领域。我们利用X API检索和构建数据集,提供全面的数据字典来描述原始数据,并进行了全面的统计描述分析。该分析揭示了有洞察力的模式和分布,并根据数据类型和性质进行了可视化。我们还使用多个机器学习分类模型对数据集进行了评估,探索了各种社会和文本特征。我们的研究结果显示出有希望的结果,特别是在文本特征方面,突出了该数据集在增强虚假新闻检测方面的潜力。此外,我们概述了未来的工作,旨在扩展VERA-ARAB,将其确立为阿拉伯语推文虚假新闻检测的基准。我们还讨论了其他可以利用VERA-ARAB数据集的潜在应用,强调了其在推进阿拉伯社交媒体虚假新闻检测领域的价值和通用性。潜在应用包括用户真实性评估、主题建模和命名实体识别,展示了该数据集在社交媒体信息质量管理更广泛研究中的广泛用途。