Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich, Jülich, Germany.
Laboratory of Brain Imaging, Nencki Institute of Experimental Biology, Polish Academy of Sciences, Warsaw, Poland.
Sci Data. 2022 Mar 11;9(1):80. doi: 10.1038/s41597-022-01163-2.
Large-scale datasets present unique opportunities to perform scientific investigations with unprecedented breadth. However, they also pose considerable challenges for the findability, accessibility, interoperability, and reusability (FAIR) of research outcomes due to infrastructure limitations, data usage constraints, or software license restrictions. Here we introduce a DataLad-based, domain-agnostic framework suitable for reproducible data processing in compliance with open science mandates. The framework attempts to minimize platform idiosyncrasies and performance-related complexities. It affords the capture of machine-actionable computational provenance records that can be used to retrace and verify the origins of research outcomes, as well as be re-executed independent of the original computing infrastructure. We demonstrate the framework's performance using two showcases: one highlighting data sharing and transparency (using the studyforrest.org dataset) and another highlighting scalability (using the largest public brain imaging dataset available: the UK Biobank dataset).
大规模数据集为进行前所未有的广泛科学研究提供了独特的机会。然而,由于基础设施限制、数据使用约束或软件许可证限制,它们也给研究成果的可发现性、可访问性、互操作性和可重用性(FAIR)带来了相当大的挑战。在这里,我们介绍了一个基于 DataLad 的、与领域无关的框架,该框架适用于符合开放科学要求的可重复数据处理。该框架试图最小化平台特殊性和与性能相关的复杂性。它提供了捕获可机器操作的计算出处记录的能力,这些记录可用于追溯和验证研究成果的起源,并且可以在不依赖原始计算基础设施的情况下重新执行。我们使用两个展示来演示该框架的性能:一个突出数据共享和透明度(使用 studyforrest.org 数据集),另一个突出可扩展性(使用最大的公共脑成像数据集:英国生物银行数据集)。