DataCurator.jl：使用编译为机器可验证模板的人类可读配方，对大型异构数据集进行高效、可移植和可重复的验证、整理与转换。

DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates.

作者信息

Cardoen Ben, Ben Yedder Hanene, Lee Sieun, Nabi Ivan Robert, Hamarneh Ghassan

机构信息

Department of Computing Science, Simon Fraser University, 8888 University Dr W, Burnaby, British Columbia V5A1S6, Canada.

Precision Imaging Beacon, University of Nottingham, Nottingham NG7 2RD, UK.

出版信息

Bioinform Adv. 2023 Jun 1;3(1):vbad068. doi: 10.1093/bioadv/vbad068. eCollection 2023.

DOI:10.1093/bioadv/vbad068

PMID:37359728

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10290225/

Abstract

Large-scale processing of heterogeneous datasets in interdisciplinary research often requires time-consuming manual data curation. Ambiguity in the data layout and preprocessing conventions can easily compromise reproducibility and scientific discovery, and even when detected, it requires time and effort to be corrected by domain experts. Poor data curation can also interrupt processing jobs on large computing clusters, causing frustration and delays. We introduce , a portable software package that verifies arbitrarily complex datasets of mixed formats, working equally well on clusters as on local systems. Human-readable TOML recipes are converted into executable, machine-verifiable templates, enabling users to easily verify datasets using custom rules without writing code. Recipes can be used to transform and validate data, for pre- or post-processing, selection of data subsets, sampling and aggregation, such as summary statistics. Processing pipelines no longer need to be burdened by laborious data validation, with data curation and validation replaced by human and machine-verifiable recipes specifying rules and actions. Multithreaded execution ensures scalability on clusters, and existing Julia, R and Python libraries can be reused. enables efficient remote workflows, offering integration with Slack and the ability to transfer curated data to clusters using OwnCloud and SCP. Code available at: https://github.com/bencardoen/DataCurator.jl.

摘要

跨学科研究中对异构数据集的大规模处理通常需要耗时的人工数据整理。数据布局和预处理惯例中的模糊性很容易损害可重复性和科学发现，而且即使被检测到，领域专家也需要花费时间和精力来纠正。糟糕的数据整理还可能中断大型计算集群上的处理作业，导致沮丧和延迟。我们引入了DataCurator，这是一个便携式软件包，可验证任意复杂的混合格式数据集，在集群和本地系统上的运行效果相同。人类可读的TOML配方被转换为可执行的、机器可验证的模板，使用户无需编写代码就能轻松地使用自定义规则验证数据集。配方可用于转换和验证数据，进行预处理或后处理、选择数据子集、采样和聚合，如汇总统计。处理管道不再需要承担繁琐的数据验证负担，数据整理和验证被指定规则和操作的人类和机器可验证配方所取代。多线程执行确保了在集群上的可扩展性，并且可以重用现有的Julia、R和Python库。DataCurator支持高效的远程工作流程，提供与Slack的集成，并能够使用OwnCloud和SCP将整理好的数据传输到集群。代码可在以下网址获取：https://github.com/bencardoen/DataCurator.jl。