characterizing the prevalence of obesity misinformation, factual content, stigma, and positivity on the social media platform reddit between 2011 and 2019: infodemiology study.

Characterizing the Prevalence of Obesity Misinformation, Factual Content, Stigma, and Positivity on the Social Media Platform Reddit Between 2011 and 2019: Infodemiology Study.

机构信息

Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, NH, United States.

Department of Epidemiology, Geisel School of Medicine at Dartmouth, Lebanon, NH, United States.

出版信息

J Med Internet Res. 2022 Dec 30;24(12):e36729. doi: 10.2196/36729.

DOI:10.2196/36729

PMID:36583929

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9840103/

Abstract

BACKGROUND

Reddit is a popular social media platform that has faced scrutiny for inflammatory language against those with obesity, yet there has been no comprehensive analysis of its obesity-related content.

OBJECTIVE

We aimed to quantify the presence of 4 types of obesity-related content on Reddit (misinformation, facts, stigma, and positivity) and identify psycholinguistic features that may be enriched within each one.

METHODS

All sentences (N=764,179) containing "obese" or "obesity" from top-level comments (n=689,447) made on non-age-restricted subreddits (ie, smaller communities within Reddit) between 2011 and 2019 that contained one of a series of keywords were evaluated. Four types of common natural language processing features were extracted: bigram term frequency-inverse document frequency, word embeddings derived from Bidirectional Encoder Representations from Transformers, sentiment from the Valence Aware Dictionary for Sentiment Reasoning, and psycholinguistic features from the Linguistic Inquiry and Word Count Program. These features were used to train an Extreme Gradient Boosting machine learning classifier to label each sentence as 1 of the 4 content categories or other. Two-part hurdle models for semicontinuous data (which use logistic regression to assess the odds of a 0 result and linear regression for continuous data) were used to evaluate whether select psycholinguistic features presented differently in misinformation (compared with facts) or stigma (compared with positivity).

RESULTS

After removing ambiguous sentences, 0.47% (3610/764,179) of the sentences were labeled as misinformation, 1.88% (14,366/764,179) were labeled as stigma, 1.94% (14,799/764,179) were labeled as positivity, and 8.93% (68,276/764,179) were labeled as facts. Each category had markers that distinguished it from other categories within the data as well as an external corpus. For example, misinformation had a higher average percent of negations (β=3.71, 95% CI 3.53-3.90; P<.001) but a lower average number of words >6 letters (β=-1.47, 95% CI -1.85 to -1.10; P<.001) relative to facts. Stigma had a higher proportion of swear words (β=1.83, 95% CI 1.62-2.04; P<.001) but a lower proportion of first-person singular pronouns (β=-5.30, 95% CI -5.44 to -5.16; P<.001) relative to positivity.

CONCLUSIONS

There are distinct psycholinguistic properties between types of obesity-related content on Reddit that can be leveraged to rapidly identify deleterious content with minimal human intervention and provide insights into how the Reddit population perceives patients with obesity. Future work should assess whether these properties are shared across languages and other social media platforms.

摘要

背景

Reddit 是一个广受欢迎的社交媒体平台，因其针对肥胖人群的煽动性言论而受到审查，但迄今为止，还没有对其与肥胖相关的内容进行全面分析。

目的

我们旨在量化 Reddit 上 4 种与肥胖相关的内容（错误信息、事实、污名化和积极性）的存在，并确定可能在每种内容中丰富的心理语言特征。

方法

从 2011 年至 2019 年，从非年龄限制子版块（即 Reddit 中的较小社区）的顶级评论（n=689,447）中提取包含一系列关键词的所有句子（N=764,179），并对其进行评估。提取了 4 种常见的自然语言处理特征：双词项频率-逆文档频率、来自 Transformer 的双向编码器表示的词向量、情感来自 Valence Aware Dictionary for Sentiment Reasoning，以及来自 Linguistic Inquiry and Word Count Program 的心理语言学特征。这些特征被用于训练一个极端梯度提升机器学习分类器，以将每个句子标记为 4 种内容类别之一或其他类别。半连续数据的两部分障碍模型（使用逻辑回归评估 0 结果的可能性，使用线性回归评估连续数据）用于评估特定心理语言特征在错误信息（与事实相比）或污名化（与积极性相比）中是否表现不同。

结果

在删除歧义句子后，764,179 个句子中有 0.47%（3610 个）被标记为错误信息，1.88%（14,366 个）被标记为污名化，1.94%（14,799 个）被标记为积极性，8.93%（68,276 个）被标记为事实。每个类别都有标记，可将其与数据中的其他类别以及外部语料库区分开来。例如，错误信息的否定句平均比例更高（β=3.71，95%CI 3.53-3.90；P<.001），但大于 6 个字母的单词平均数量较少（β=-1.47，95%CI -1.85 至-1.10；P<.001）与事实相比。污名化的咒骂词比例更高（β=1.83，95%CI 1.62-2.04；P<.001），但第一人称单数代词的比例较低（β=-5.30，95%CI -5.44 至-5.16；P<.001）与积极性相比。