Shah Zubair, Martin Paige, Coiera Enrico, Mandl Kenneth D, Dunn Adam G
Centre for Health Informatics, Australian Institute for Health Innovation, Macquarie University, Sydney, Australia.
Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States.
J Med Internet Res. 2019 May 8;21(5):e12881. doi: 10.2196/12881.
Studies examining how sentiment on social media varies depending on timing and location appear to produce inconsistent results, making it hard to design systems that use sentiment to detect localized events for public health applications.
The aim of this study was to measure how common timing and location confounders explain variation in sentiment on Twitter.
Using a dataset of 16.54 million English-language tweets from 100 cities posted between July 13 and November 30, 2017, we estimated the positive and negative sentiment for each of the cities using a dictionary-based sentiment analysis and constructed models to explain the differences in sentiment using time of day, day of week, weather, city, and interaction type (conversations or broadcasting) as factors and found that all factors were independently associated with sentiment.
In the full multivariable model of positive (Pearson r in test data 0.236; 95% CI 0.231-0.241) and negative (Pearson r in test data 0.306; 95% CI 0.301-0.310) sentiment, the city and time of day explained more of the variance than weather and day of week. Models that account for these confounders produce a different distribution and ranking of important events compared with models that do not account for these confounders.
In public health applications that aim to detect localized events by aggregating sentiment across populations of Twitter users, it is worthwhile accounting for baseline differences before looking for unexpected changes.
关于社交媒体上的情绪如何随时间和地点变化的研究似乎产生了不一致的结果,这使得设计利用情绪来检测公共卫生应用中的局部事件的系统变得困难。
本研究的目的是衡量常见的时间和地点混杂因素如何解释推特上的情绪变化。
我们使用了一个包含2017年7月13日至11月30日期间100个城市发布的1654万条英语推文的数据集,使用基于词典的情绪分析方法估计每个城市的积极和消极情绪,并构建模型,以一天中的时间、一周中的日期、天气、城市和互动类型(对话或广播)作为因素来解释情绪差异,发现所有因素都与情绪独立相关。
在积极情绪(测试数据中的皮尔逊r为0.236;95%置信区间为0.231 - 0.241)和消极情绪(测试数据中的皮尔逊r为0.306;95%置信区间为0.301 - 0.310)的完整多变量模型中,城市和一天中的时间比天气和一周中的日期解释了更多的方差。与不考虑这些混杂因素的模型相比,考虑这些混杂因素的模型会产生不同的重要事件分布和排名。
在旨在通过汇总推特用户群体的情绪来检测局部事件的公共卫生应用中,在寻找意外变化之前考虑基线差异是值得的。