Lan Hai, Sha Dexuan, Malarvizhi Anusha Srirenganathan, Liu Yi, Li Yun, Meister Nadine, Liu Qian, Wang Zifu, Yang Jingchao, Yang Chaowei Phil
NSF Spatiotemporal Innovation CenterGeorge Mason University Fairfax VA 22030 USA.
Department of Geography and Geoinformation ScienceGeorge Mason University Fairfax VA 22030 USA.
IEEE Access. 2021 Jun 3;9:84783-84798. doi: 10.1109/ACCESS.2021.3085682. eCollection 2021.
In 2019, COVID-19 quickly spread across the world, infecting billions of people and disrupting the normal lives of citizens in every country. Governments, organizations, and research institutions all over the world are dedicating vast resources to research effective strategies to fight this rapidly propagating virus. With virus testing, most countries publish the number of confirmed cases, dead cases, recovered cases, and locations routinely through various channels and forms. This important data source has enabled researchers worldwide to perform different COVID-19 scientific studies, such as modeling this virus's spreading patterns, developing prevention strategies, and studying the impact of COVID-19 on other aspects of society. However, one major challenge is that there is no standardized, updated, and high-quality data product that covers COVID-19 cases data internationally. This is because different countries may publish their data in unique channels, formats, and time intervals, which hinders researchers from fetching necessary COVID-19 datasets effectively, especially for fine-scale studies. Although existing solutions such as John's Hopkins COVID-19 Dashboard and 1point3acres COVID-19 tracker are widely used, it is difficult for users to access their original dataset and customize those data to meet specific requirements in categories, data structure, and data source selection. To address this challenge, we developed a toolset using cloud-based web scraping to extract, refine, unify, and store COVID-19 cases data at multiple scales for all available countries around the world automatically. The toolset then publishes the data for public access in an effective manner, which could offer users a real time COVID-19 dynamic dataset with a global view. Two case studies are presented about how to utilize the datasets. This toolset can also be easily extended to fulfill other purposes with its open-source nature.
2019年,新冠病毒在全球迅速传播,感染了数十亿人,扰乱了每个国家公民的正常生活。世界各地的政府、组织和研究机构都投入大量资源,研究应对这种快速传播病毒的有效策略。通过病毒检测,大多数国家通过各种渠道和形式定期公布确诊病例、死亡病例、康复病例及地点的数量。这一重要数据源使全球研究人员能够开展不同的新冠病毒科学研究,比如对该病毒的传播模式进行建模、制定预防策略以及研究新冠病毒对社会其他方面的影响。然而,一个主要挑战是缺乏一个覆盖全球新冠病例数据的标准化、更新且高质量的数据产品。这是因为不同国家可能通过独特的渠道、格式和时间间隔发布数据,这阻碍了研究人员有效获取必要的新冠病毒数据集,特别是对于精细尺度的研究。尽管诸如约翰·霍普金斯大学新冠病毒仪表盘和一亩三分地新冠病毒追踪器等现有解决方案被广泛使用,但用户很难获取其原始数据集并根据类别、数据结构和数据源选择等特定要求对这些数据进行定制。为应对这一挑战,我们开发了一套工具集,利用基于云的网络爬虫技术自动提取、提炼、统一并存储全球所有可用国家多尺度的新冠病例数据。然后,该工具集以有效的方式发布数据以供公众访问,这可以为用户提供一个具有全球视野的实时新冠病毒动态数据集。文中给出了两个关于如何利用这些数据集的案例研究。由于其开源性质,该工具集还可以轻松扩展以实现其他目的。