Zhang Chao, Bijlard Jochem, Staiger Christine, Scollen Serena, van Enckevort David, Hoogstrate Youri, Senf Alexander, Hiltemann Saskia, Repo Susanna, Pipping Wibo, Bierkens Mariska, Payralbe Stefan, Stringer Bas, Heringa Jaap, Stubbs Andrew, Bonino Da Silva Santos Luiz Olavo, Belien Jeroen, Weistra Ward, Azevedo Rita, van Bochove Kees, Meijer Gerrit, Boiten Jan-Willem, Rambla Jordi, Fijneman Remond, Spalding J Dylan, Abeln Sanne
Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, 1081 HV, Netherlands.
The Hyve, Utrecht, 3511 MJ, Netherlands.
F1000Res. 2017 Aug 16;6. doi: 10.12688/f1000research.12168.1. eCollection 2017.
The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.
高通量分子谱分析技术的出现为常规临床研究提供了更准确、更丰富的信息数据。然而,需要复杂的计算工作流程来解读这些数据。在过去几年中,数据量呈爆炸式增长,需要强大的人力数据管理来有效地组织和整合数据。因此,我们与转化研究信息技术(TraIT)计划一起开展了一项ELIXIR实施研究,以设计一个能够链接原始数据和解读后数据的数据生态系统。在这个项目中,来自TraIT细胞系用例(TraIT-CLUC)的数据被用作该系统的测试用例。在这个生态系统中,我们使用欧洲基因组-表型档案库(EGA)来存储原始分子谱分析数据;使用tranSMART来收集相应样本的解读后分子谱分析数据和临床数据;使用Galaxy来存储、运行和管理计算工作流程。我们可以通过系统地链接它们的存储库来整合这些数据。为了展示我们的设计,我们对包含各种分子谱分析数据类型的TraIT-CLUC数据进行了结构化处理,以便存储在tranSMART和EGA中。所提供的元数据允许在tranSMART和EGA之间进行引用,完成数据提交和发现的循环;我们还设计了从EGA到Galaxy的数据流,使Galaxy能够对原始数据进行重新分析。通过这种方式,用户可以在tranSMART中选择患者队列,追溯到原始数据并在Galaxy中进行(重新)分析。我们的结论是,大多数元数据不一定需要(重复)存储在两个数据库中,而是应该为定义明确的数据本体级别提供FAIR持久标识符:研究、数据访问委员会、物理样本、数据样本和原始数据文件。这种方法将为数据的稳定链接和重用铺平道路。