Huang Yu-Ning, Jaiswal Pooja Vinod, Rajes Anushka, Yadav Anushka, Yu Dottie, Liu Fangyun, Scheg Grace, Shih Emma, Boldirev Grigore, Nakashidze Irina, Sarkar Aditya, Mehta Jay Himanshu, Wang Ke, Patel Khooshbu Kantibhai, Mirza Mustafa Ali Baig, Hapani Kunali Chetan, Peng Qiushi, Ayyala Ram, Guo Ruiwei, Kapur Shaunak, Ramesh Tejasvene, Ciorbă Dumitru, Munteanu Viorel, Bostan Viorel, Dimian Mihai, Abedalthagafi Malak S, Mangul Serghei
Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, 90089, USA.
Department of Pharmacology and Pharmaceutical Sciences, Alfred E. Mann School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
Genome Biol. 2025 Sep 9;26(1):274. doi: 10.1186/s13059-025-03725-0.
Recent advances in high-throughput sequencing technologies have enabled the collection and sharing of a massive amount of omics data, along with its associated metadata-descriptive information that contextualizes the data, including phenotypic traits and experimental design. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data may hinder reproducibility and reusability and limit secondary analyses.
Our study assesses the completeness of metadata in over 253 scientific studies, covering more than 164,000 samples from both human and non-human mammalian studies. We find that over 25% of critical metadata are omitted, with only 74.8% of relevant phenotypes available in publications or public repositories. Notably, public repositories alone contain 62% of the phenotypes, surpassing the textual content of publications by 3.5%. Only 11.5% of studies completely shared all phenotypes, while 37.9% shared less than 40% of the phenotypes. Additionally, studies with non-human samples are more likely to include complete metadata compared to human studies. Similar trends are observed in an extended dataset comprising 61,000 studies and 2.1 million samples from the Gene Expression Omnibus (GEO) data repository.
These findings highlight significant gaps in metadata sharing, underscoring the need for standardized practices to improve metadata availability. Enhanced metadata reporting would foster data reusability, support better-informed decision-making, and promote reproducible research across the biomedical field.
高通量测序技术的最新进展使得大量组学数据及其相关元数据(即对数据进行情境化的描述性信息,包括表型特征和实验设计)得以收集和共享。提高元数据的可用性对于确保数据的可重复使用性和可再现性以及通过有效的数据再利用促进新的生物医学发现至关重要。然而,公共组学数据所附带的不完整元数据可能会阻碍可重复性和可再利用性,并限制二次分析。
我们的研究评估了253多项科学研究中元数据的完整性,这些研究涵盖了来自人类和非人类哺乳动物研究的超过164,000个样本。我们发现超过25%的关键元数据被遗漏,出版物或公共存储库中仅提供了74.8%的相关表型。值得注意的是,仅公共存储库就包含了62%的表型,比出版物的文本内容多3.5%。只有11.5%的研究完全共享了所有表型,而37.9%的研究共享的表型不到40%。此外,与人类研究相比,非人类样本的研究更有可能包含完整的元数据。在一个包含来自基因表达综合数据库(GEO)的61,000项研究和210万个样本的扩展数据集中也观察到了类似的趋势。
这些发现凸显了元数据共享方面的重大差距,强调了需要采用标准化做法来提高元数据的可用性。增强元数据报告将促进数据的可再利用性,支持更明智的决策,并推动生物医学领域的可重复研究。