Evolution & Ecology Research Centre, and School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney NSW 2052, Australia.
Department of Infectious Disease Epidemiology, Imperial College London, Faculty of Medicine, Norfolk Place, London W2 1PG, UK.
Gigascience. 2019 May 1;8(5). doi: 10.1093/gigascience/giz035.
The sharing and re-use of data has become a cornerstone of modern science. Multiple platforms now allow easy publication of datasets. So far, however, platforms for data sharing offer limited functions for distributing and interacting with evolving datasets- those that continue to grow with time as more records are added, errors fixed, and new data structures are created. In this article, we describe a workflow for maintaining and distributing successive versions of an evolving dataset, allowing users to retrieve and load different versions directly into the R platform. Our workflow utilizes tools and platforms used for development and distribution of successive versions of an open source software program, including version control, GitHub, and semantic versioning, and applies these to the analogous process of developing successive versions of an open source dataset. Moreover, we argue that this model allows for individual research groups to achieve a dynamic and versioned model of data delivery at no cost.
数据共享和再利用已经成为现代科学的基石。现在有多个平台可以方便地发布数据集。然而,到目前为止,数据共享平台在分发和交互不断发展的数据集方面提供的功能有限——这些数据集随着时间的推移而不断增长,随着添加更多记录、修复错误和创建新的数据结构而不断发展。在本文中,我们描述了一种维护和分发不断发展的数据集的工作流程,允许用户直接将不同版本检索和加载到 R 平台中。我们的工作流程利用了用于开发和分发开源软件程序的连续版本的工具和平台,包括版本控制、GitHub 和语义版本控制,并将这些应用于开发开源数据集的连续版本的类似过程。此外,我们认为这种模式允许各个研究小组免费实现数据交付的动态和版本化模型。