从元数据困境中培养孤儿：公共“组学”数据再利用研究人员指南。

Raising orphans from a metadata morass: A researcher's guide to re-use of public 'omics data.

机构信息

Dept. of Genetics Development and Cell Biology, Iowa State University, Ames IA 50010, USA; Center for Metabolic Biology, Iowa State University, Ames, IA 50011, USA.

Genome Informatics Facility, Office of Biotechnology, Iowa State University, Ames, IA 50011, USA.

出版信息

Plant Sci. 2018 Feb;267:32-47. doi: 10.1016/j.plantsci.2017.10.014. Epub 2017 Nov 7.

DOI:10.1016/j.plantsci.2017.10.014

PMID:29362097

Abstract

More than 15 petabases of raw RNAseq data is now accessible through public repositories. Acquisition of other 'omics data types is expanding, though most lack a centralized archival repository. Data-reuse provides tremendous opportunity to extract new knowledge from existing experiments, and offers a unique opportunity for robust, multi-'omics analyses by merging metadata (information about experimental design, biological samples, protocols) and data from multiple experiments. We illustrate how predictive research can be accelerated by meta-analysis with a study of orphan (species-specific) genes. Computational predictions are critical to infer orphan function because their coding sequences provide very few clues. The metadata in public databases is often confusing; a test case with Zea mays mRNA seq data reveals a high proportion of missing, misleading or incomplete metadata. This metadata morass significantly diminishes the insight that can be extracted from these data. We provide tips for data submitters and users, including specific recommendations to improve metadata quality by more use of controlled vocabulary and by metadata reviews. Finally, we advocate for a unified, straightforward metadata submission and retrieval system.

摘要

现在通过公共存储库可以访问超过 1500 petabytes 的原始 RNAseq 数据。其他“组学”数据类型的获取正在扩展，尽管大多数缺乏集中的档案存储库。数据再利用提供了从现有实验中提取新知识的巨大机会，并通过合并元数据（有关实验设计、生物样本、方案的信息）和来自多个实验的数据，为稳健的多“组学”分析提供了独特的机会。我们通过对孤儿（物种特异性）基因的研究来说明元分析如何加速预测性研究。由于其编码序列提供的线索很少，因此计算预测对于推断孤儿功能至关重要。公共数据库中的元数据通常令人困惑；使用 Zea mays mRNA seq 数据的测试案例揭示了大量缺失、误导或不完整的元数据。这种元数据混乱极大地削弱了可以从这些数据中提取的洞察力。我们为数据提交者和用户提供了提示，包括通过更多使用受控词汇和元数据审查来提高元数据质量的具体建议。最后，我们主张建立一个统一、简单的数据提交和检索系统。