Positive Psychology Center, Department of Psychology, University of Pennsylvania, Philadelphia, PA, United States.
JMIR Public Health Surveill. 2015 Jun 26;1(1):e6. doi: 10.2196/publichealth.3953.
Twitter is increasingly used to estimate disease prevalence, but such measurements can be biased, due to both biased sampling and inherent ambiguity of natural language.
We characterized the extent of these biases and how they vary with disease.
We correlated self-reported prevalence rates for 22 diseases from Experian's Simmons National Consumer Study (n=12,305) with the number of times these diseases were mentioned on Twitter during the same period (2012). We also identified and corrected for two types of bias present in Twitter data: (1) demographic variance between US Twitter users and the general US population; and (2) natural language ambiguity, which creates the possibility that mention of a disease name may not actually refer to the disease (eg, "heart attack" on Twitter often does not refer to myocardial infarction). We measured the correlation between disease prevalence and Twitter disease mentions both with and without bias correction. This allowed us to quantify each disease's overrepresentation or underrepresentation on Twitter, relative to its prevalence.
Our sample included 80,680,449 tweets. Adjusting disease prevalence to correct for Twitter demographics more than doubles the correlation between Twitter disease mentions and disease prevalence in the general population (from .113 to .258, P <.001). In addition, diseases varied widely in how often mentions of their names on Twitter actually referred to the diseases, from 14.89% (3827/25,704) of instances (for stroke) to 99.92% (5044/5048) of instances (for arthritis). Applying ambiguity correction to our Twitter corpus achieves a correlation between disease mentions and prevalence of .208 ( P <.001). Simultaneously applying correction for both demographics and ambiguity more than triples the baseline correlation to .366 ( P <.001). Compared with prevalence rates, cancer appeared most overrepresented in Twitter, whereas high cholesterol appeared most underrepresented.
Twitter is a potentially useful tool to measure public interest in and concerns about different diseases, but when comparing diseases, improvements can be made by adjusting for population demographics and word ambiguity.
Twitter 正逐渐被用于估计疾病的流行率,但由于采样偏差和自然语言的固有模糊性,此类测量可能存在偏差。
我们描述了这些偏差的程度及其随疾病的变化情况。
我们将 Experian 的 Simmons 全国消费者研究(n=12305)中报告的 22 种疾病的自报流行率与同期在 Twitter 上提及这些疾病的次数(2012 年)进行了相关性分析。我们还确定并纠正了 Twitter 数据中存在的两种类型的偏差:(1)美国 Twitter 用户与美国一般人群之间的人口统计学差异;(2)自然语言模糊性,这使得提及疾病名称的可能性不一定指的是该疾病(例如,Twitter 上的“心脏病发作”通常并不指心肌梗死)。我们在进行和不进行偏差校正的情况下,分别测量了疾病流行率与 Twitter 疾病提及率之间的相关性。这使我们能够量化每种疾病在 Twitter 上相对于其流行率的过度或不足。
我们的样本包括 80680449 条推文。通过调整疾病流行率来校正 Twitter 人口统计学数据,Twitter 疾病提及与一般人群中疾病流行率之间的相关性增加了一倍以上(从.113 增加到.258,P<0.001)。此外,Twitter 上疾病名称的提及与实际疾病之间的关联程度差异很大,从 14.89%(25704 次中的 3827 次)到 99.92%(5048 次中的 5044 次)。在我们的 Twitter 语料库中应用歧义校正后,疾病提及与流行率之间的相关性达到.208(P<0.001)。同时应用人口统计学和歧义校正可将基线相关性提高三倍以上,达到.366(P<0.001)。与流行率相比,癌症在 Twitter 上的出现频率似乎过高,而高胆固醇的出现频率似乎过低。
Twitter 是一种衡量公众对不同疾病的兴趣和关注的潜在有用工具,但在比较疾病时,通过调整人口统计学数据和词汇歧义,可以提高其准确性。