Suppr超能文献

来自Reddit的用户提交的带有意识形态和极端偏见的文章的综合数据集。

Comprehensive dataset of user-submitted articles with ideological and extreme bias from Reddit.

作者信息

Ravi Kamalakkannan, Vela Adan Ernesto

机构信息

University of Central Florida, Orlando, USA.

出版信息

Data Brief. 2024 Aug 22;56:110849. doi: 10.1016/j.dib.2024.110849. eCollection 2024 Oct.

Abstract

Our study aims to collect data to understand ideological and extreme bias in text articles shared across various online communities, particularly focusing on the language used in subreddits associated with extremism and targeted violence. Initially, we gathered data from related online communities, specifically the r/Liberal and r/Conservative communities on Reddit, utilizing the Reddit Pushshift API to collect URLs shared within these subreddits. Our aim was to gather news, opinion, and feature articles, resulting in a corpus of 226,010 articles. We also curated a balanced subset of 45,108 articles and annotated 4000 articles to validate their relevance, facilitating understanding of language usage within ideological Reddit communities and insights into ideological bias in media content. Expanding beyond binary ideologies, we introduced a new category termed "Restricted" to encompass articles shared in private or banned subreddits. This third category encompasses articles shared in restricted, privatized, quarantined, or banned subreddits characterized by radicalized and extremist ideologies. This expansion yielded a large dataset of 377,144 articles. Additionally, we included articles from subreddits with unspecified ideologies, creating a holdout set of 922,522 articles. In total, our combined dataset of 1.3 million articles collected from 55 different subreddits will assist in examining radicalized communities and providing discourse analysis in associated subreddits, enhancing understanding of the language used in articles shared within radicalized Reddit communities and offering insights into extreme bias in media content. In summary, we collected 1.52 million articles to understand ideological and extreme bias, providing a comprehensive dataset that aids in understanding language usage within text articles posted in ideological and extreme Reddit communities.

摘要

我们的研究旨在收集数据,以了解在各个在线社区分享的文本文章中的意识形态和极端偏见,尤其关注与极端主义和针对性暴力相关的Reddit子版块中使用的语言。最初,我们从相关在线社区收集数据,具体是Reddit上的r/Liberal和r/Conservative社区,利用Reddit Pushshift API收集这些子版块内分享的网址。我们的目标是收集新闻、观点和专题文章,最终得到一个包含226,010篇文章的语料库。我们还精心挑选了一个包含45,108篇文章的平衡子集,并对4000篇文章进行注释以验证其相关性,这有助于理解意识形态Reddit社区内的语言使用情况以及洞察媒体内容中的意识形态偏见。超越二元意识形态,我们引入了一个名为“受限”的新类别,以涵盖在私人或被封禁子版块中分享的文章。这第三类包括在以激进和极端意识形态为特征的受限、私有化、隔离或被封禁子版块中分享的文章。这一扩展产生了一个包含377,144篇文章的大型数据集。此外,我们纳入了来自意识形态未明确子版块的文章,创建了一个包含922,522篇文章的保留集。总体而言,我们从55个不同子版块收集的130万篇文章的综合数据集将有助于研究激进社区并在相关子版块中进行话语分析,增进对激进Reddit社区内分享文章中使用语言的理解,并洞察媒体内容中的极端偏见。总之,我们收集了152万篇文章以了解意识形态和极端偏见,提供了一个全面的数据集,有助于理解在意识形态和极端Reddit社区中发布的文本文章中的语言使用情况。

相似文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验