Olsson Tjelvar S G, Hartley Matthew
Computational Systems Biology, John Innes Centre, Norwich, UK, United Kingdom.
PeerJ. 2019 Mar 7;7:e6562. doi: 10.7717/peerj.6562. eCollection 2019.
The explosion in volumes and types of data has led to substantial challenges in data management. These challenges are often faced by front-line researchers who are already dealing with rapidly changing technologies and have limited time to devote to data management. There are good high-level guidelines for managing and processing scientific data. However, there is a lack of simple, practical tools to implement these guidelines. This is particularly problematic in a highly distributed research environment where needs differ substantially from group to group and centralised solutions are difficult to implement and storage technologies change rapidly. To meet these challenges we have developed dtool, a command line tool for managing data. The tool packages data and metadata into a unified whole, which we call a dataset. The dataset provides consistency checking and the ability to access metadata for both the whole dataset and individual files. The tool can store these datasets on several different storage systems, including a traditional file system, object store (S3 and Azure) and iRODS. It includes an application programming interface that can be used to incorporate it into existing pipelines and workflows. The tool has provided substantial process, cost, and peace-of-mind benefits to our data management practices and we want to share these benefits. The tool is open source and available freely online at http://dtool.readthedocs.io.
数据量和数据类型的激增给数据管理带来了巨大挑战。一线研究人员常常面临这些挑战,他们已经在应对快速变化的技术,并且用于数据管理的时间有限。对于管理和处理科学数据有一些很好的高级指导方针。然而,缺乏简单实用的工具来实施这些指导方针。在高度分布式的研究环境中,这一问题尤为突出,因为不同团队的需求差异很大,集中式解决方案难以实施,而且存储技术变化迅速。为了应对这些挑战,我们开发了dtool,这是一个用于管理数据的命令行工具。该工具将数据和元数据打包成一个统一的整体,我们称之为数据集。数据集提供一致性检查功能,并能够访问整个数据集和单个文件的元数据。该工具可以将这些数据集存储在多种不同的存储系统上,包括传统文件系统、对象存储(S3和Azure)以及iRODS。它包括一个应用程序编程接口,可用于将其纳入现有的管道和工作流程中。该工具为我们的数据管理实践带来了显著的流程、成本和安心方面的好处,我们希望分享这些好处。该工具是开源的,可在http://dtool.readthedocs.io上免费在线获取。