Wang Yichen, Sarfraz Irzam, Teh Wei Kheng, Sokolov Artem, Herb Brian R, Creasy Heather H, Virshup Isaac, Dries Ruben, Degatano Kylee, Mahurkar Anup, Schnell Daniel J, Madrigal Pedro, Hilton Jason, Gehlenborg Nils, Tickle Timothy, Campbell Joshua D
Department of Medicine, Boston University School of Medicine, Boston, MA, USA.
European Bioinformatics Institute, European Molecular Biology Laboratory, Hinxton, Cambridgeshire, UK.
bioRxiv. 2023 Mar 7:2023.03.06.531314. doi: 10.1101/2023.03.06.531314.
A large number of genomic and imaging datasets are being produced by consortia that seek to characterize healthy and disease tissues at single-cell resolution. While much effort has been devoted to capturing information related to biospecimen information and experimental procedures, the metadata standards that describe data matrices and the analysis workflows that produced them are relatively lacking. Detailed metadata schema related to data analysis are needed to facilitate sharing and interoperability across groups and to promote data provenance for reproducibility. To address this need, we developed the Matrix and Analysis Metadata Standards (MAMS) to serve as a resource for data coordinating centers and tool developers. We first curated several simple and complex "use cases" to characterize the types of feature-observation matrices (FOMs), annotations, and analysis metadata produced in different workflows. Based on these use cases, metadata fields were defined to describe the data contained within each matrix including those related to processing, modality, and subsets. Suggested terms were created for the majority of fields to aid in harmonization of metadata terms across groups. Additional provenance metadata fields were also defined to describe the software and workflows that produced each FOM. Finally, we developed a simple list-like schema that can be used to store MAMS information and implemented in multiple formats. Overall, MAMS can be used as a guide to harmonize analysis-related metadata which will ultimately facilitate integration of datasets across tools and consortia. MAMS specifications, use cases, and examples can be found at https://github.com/single-cell-mams/mams/.
许多基因组和成像数据集正由致力于以单细胞分辨率表征健康和疾病组织的联盟生成。尽管已经投入了大量精力来获取与生物样本信息和实验程序相关的信息,但描述数据矩阵的元数据标准以及生成这些矩阵的分析工作流程相对缺乏。需要详细的与数据分析相关的元数据模式,以促进跨组共享和互操作性,并促进数据溯源以实现可重复性。为满足这一需求,我们开发了矩阵和分析元数据标准(MAMS),作为数据协调中心和工具开发者的资源。我们首先策划了几个简单和复杂的“用例”,以表征不同工作流程中产生的特征-观测矩阵(FOM)、注释和分析元数据的类型。基于这些用例,定义了元数据字段来描述每个矩阵中包含的数据,包括与处理、模态和子集相关的数据。为大多数字段创建了建议术语,以帮助跨组统一元数据术语。还定义了额外的溯源元数据字段,以描述生成每个FOM的软件和工作流程。最后,我们开发了一个简单的列表式模式,可用于存储MAMS信息并以多种格式实现。总体而言,MAMS可作为统一与分析相关的元数据的指南,这最终将促进跨工具和联盟的数据集整合。MAMS规范、用例和示例可在https://github.com/single-cell-mams/mams/上找到。