Burton Scott H, Tanner Kesler W, Giraud-Carrier Christophe G, West Joshua H, Barnes Michael D
Brigham Young University, Computational Health Science Research Group, Department of Computer Science, Provo, UT 84602, USA.
J Med Internet Res. 2012 Nov 15;14(6):e156. doi: 10.2196/jmir.2121.
Twitter provides various types of location data, including exact Global Positioning System (GPS) coordinates, which could be used for infoveillance and infodemiology (ie, the study and monitoring of online health information), health communication, and interventions. Despite its potential, Twitter location information is not well understood or well documented, limiting its public health utility.
The objective of this study was to document and describe the various types of location information available in Twitter. The different types of location data that can be ascertained from Twitter users are described. This information is key to informing future research on the availability, usability, and limitations of such location data.
Location data was gathered directly from Twitter using its application programming interface (API). The maximum tweets allowed by Twitter were gathered (1% of the total tweets) over 2 separate weeks in October and November 2011. The final dataset consisted of 23.8 million tweets from 9.5 million unique users. Frequencies for each of the location options were calculated to determine the prevalence of the various location data options by region of the world, time zone, and state within the United States. Data from the US Census Bureau were also compiled to determine population proportions in each state, and Pearson correlation coefficients were used to compare each state's population with the number of Twitter users who enable the GPS location option.
The GPS location data could be ascertained for 2.02% of tweets and 2.70% of unique users. Using a simple text-matching approach, 17.13% of user profiles in the 4 continental US time zones were able to be used to determine the user's city and state. Agreement between GPS data and data from the text-matching approach was high (87.69%). Furthermore, there was a significant correlation between the number of Twitter users per state and the 2010 US Census state populations (r ≥ 0.97, P < .001).
Health researchers exploring ways to use Twitter data for disease surveillance should be aware that the majority of tweets are not currently associated with an identifiable geographic location. Location can be identified for approximately 4 times the number of tweets using a straightforward text-matching process compared to using the GPS location information available in Twitter. Given the strong correlation between both data gathering methods, future research may consider using more qualitative approaches with higher yields, such as text mining, to acquire information about Twitter users' geographical location.
推特提供了各种类型的位置数据,包括精确的全球定位系统(GPS)坐标,这些数据可用于信息监测和信息流行病学(即对在线健康信息的研究和监测)、健康传播及干预措施。尽管推特位置信息有其潜力,但人们对其了解和记录并不充分,限制了其在公共卫生方面的效用。
本研究的目的是记录和描述推特中可用的各种类型的位置信息。描述了可以从推特用户那里确定的不同类型的位置数据。这些信息对于为未来关于此类位置数据的可用性、实用性和局限性的研究提供参考至关重要。
通过推特的应用程序编程接口(API)直接从推特收集位置数据。在2011年10月和11月的两个不同星期内收集了推特允许的最大推文数量(占推文总数的1%)。最终数据集由来自950万不同用户的2380万条推文组成。计算每个位置选项的频率,以确定按世界区域、时区和美国各州划分的各种位置数据选项的流行程度。还汇编了美国人口普查局的数据,以确定每个州的人口比例,并使用皮尔逊相关系数将每个州的人口与启用GPS位置选项的推特用户数量进行比较。
可以确定2.02%的推文和2.70%的不同用户的GPS位置数据。使用简单的文本匹配方法,在美国大陆四个时区中,17.13%的用户资料能够用于确定用户所在的城市和州。GPS数据与文本匹配方法得出的数据之间的一致性很高(87.69%)。此外,每个州的推特用户数量与美国2010年人口普查的州人口之间存在显著相关性(r≥0.97,P<0.001)。
探索利用推特数据进行疾病监测方法的健康研究人员应意识到,目前大多数推文与可识别的地理位置无关。与使用推特中可用的GPS位置信息相比,使用直接的文本匹配过程可以为大约四倍数量的推文识别位置。鉴于两种数据收集方法之间的强相关性,未来的研究可能会考虑使用产量更高的更定性的方法,如文本挖掘,来获取有关推特用户地理位置的信息。