Brown Nicholas J L, Coyne James C
University Medical Center, University of Groningen, Groningen, Netherlands.
PeerJ. 2018 Sep 21;6:e5656. doi: 10.7717/peerj.5656. eCollection 2018.
We comment on Eichstaedt et al.'s (2015a) claim to have shown that language patterns among Twitter users, aggregated at the level of US counties, predicted county-level mortality rates from atherosclerotic heart disease (AHD), with "negative" language being associated with higher rates of death from AHD and "positive" language associated with lower rates. First, we examine some of Eichstaedt et al.'s apparent assumptions about the nature of AHD, as well as some issues related to the secondary analysis of online data and to considering counties as communities. Next, using the data files supplied by Eichstaedt et al., we reproduce their regression- and correlation-based models, substituting mortality from an alternative cause of death-namely, suicide-as the outcome variable, and observe that the purported associations between "negative" and "positive" language and mortality are reversed when suicide is used as the outcome variable. We identify numerous other conceptual and methodological limitations that call into question the robustness and generalizability of Eichstaedt et al.'s claims, even when these are based on the results of their ridge regression/machine learning model. We conclude that there is no good evidence that analyzing Twitter data in bulk in this way can add anything useful to our ability to understand geographical variation in AHD mortality rates.
我们对艾希施泰特等人(2015年a)的说法进行评论。他们声称,在美国各县层面汇总的推特用户语言模式能够预测动脉粥样硬化性心脏病(AHD)的县级死亡率,其中“负面”语言与AHD的较高死亡率相关,“正面”语言与较低死亡率相关。首先,我们审视艾希施泰特等人关于AHD本质的一些明显假设,以及与在线数据二次分析和将各县视为社区相关的一些问题。接下来,使用艾希施泰特等人提供的数据文件,我们重现他们基于回归和相关性的模型,将另一种死因——即自杀——的死亡率作为结果变量代入,并且观察到,当将自杀作为结果变量时,“负面”和“正面”语言与死亡率之间所谓的关联会发生逆转。我们识别出许多其他概念和方法上的局限性,这些局限性使人们对艾希施泰特等人说法的稳健性和可推广性产生质疑,即便这些说法是基于他们的岭回归/机器学习模型的结果。我们的结论是,没有充分证据表明以这种方式批量分析推特数据能够为我们理解AHD死亡率的地理差异能力增添任何有用的内容。