Huang Yu-Ning, Jaiswal Pooja Vinod, Rajesh Anushka, Yadav Anushka, Yu Dottie, Liu Fangyun, Scheg Grace, Shih Emma, Boldirev Grigore, Nakashidze Irina, Sarkar Aditya, Mehta Jay Himanshu, Wang Ke, Patel Khooshbu Kantibhai, Mirza Mustafa Ali Baig, Hapani Kunali Chetan, Peng Qiushi, Ayyala Ram, Guo Ruiwei, Kapur Shaunak, Ramesh Tejasvene, Ciorbă Dumitru, Munteanu Viorel, Bostan Viorel, Dimian Mihai, Abedalthagafi Malak S, Mangul Serghei
Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, California, 90089, USA.
Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy, University of Southern California, Los Angeles, California, 90089, USA.
bioRxiv. 2025 Jul 7:2021.11.22.469640. doi: 10.1101/2021.11.22.469640.
Recent advances in high-throughput sequencing technologies have made it possible to collect and share a massive amount of omics data, along with its associated metadata. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data may hinder reproducibility and reusability by reducing sample interpretability and limiting secondary analyses. In this study, we performed a comprehensive assessment of metadata completeness shared in both scientific publications and/or public repositories by analyzing over 253 studies encompassing over 164 thousands samples, including both human and non-human mammalian studies. We observed that studies often omit over a quarter of important phenotypes, with an average of only 74.8% of them shared either in the text of publication or the corresponding repository. Notably, public repositories alone contained 62% of the metadata, surpassing the textual content of publications by 3.5%. Only 11.5% of studies completely shared all phenotypes, while 37.9% shared less than 40% of the phenotypes. Studies involving non-human samples were more likely to share metadata than studies involving human samples. We observed similar results on the extended dataset spanning 2.1 million samples across over 61,000 studies from the Gene Expression Omnibus repository. The limited availability of metadata reported in our study emphasizes the necessity for improved metadata sharing practices and standardized reporting. Finally, we discuss the numerous benefits of improving the availability and quality of metadata to the scientific community and beyond, supporting data-driven decision-making and policy development in the field of biomedical research. This work provides a scalable framework for evaluating metadata availability and may help guide future policy and infrastructure development.
高通量测序技术的最新进展使得收集和共享大量组学数据及其相关元数据成为可能。提高元数据的可用性对于确保数据的可重用性和可重复性以及通过有效的数据重用促进新的生物医学发现至关重要。然而,公共组学数据所附带的不完整元数据可能会降低样本的可解释性并限制二次分析,从而阻碍可重复性和可重用性。在本研究中,我们通过分析涵盖超过16.4万个样本的253项以上研究(包括人类和非人类哺乳动物研究),对科学出版物和/或公共存储库中共享的元数据完整性进行了全面评估。我们观察到,研究常常遗漏超过四分之一的重要表型,平均只有74.8%的表型在出版物文本或相应存储库中共享。值得注意的是,仅公共存储库就包含了62%的元数据,比出版物的文本内容多3.5%。只有11.5%的研究完全共享了所有表型,而37.9%的研究共享的表型不到40%。涉及非人类样本的研究比涉及人类样本的研究更有可能共享元数据。我们在来自基因表达综合数据库的跨越61000多项研究的210万个样本的扩展数据集上观察到了类似的结果。我们研究中报告的元数据可用性有限,这凸显了改进元数据共享实践和标准化报告的必要性。最后,我们讨论了提高元数据的可用性和质量对科学界及其他领域的诸多好处,支持生物医学研究领域的数据驱动决策和政策制定。这项工作提供了一个可扩展的框架来评估元数据可用性,并可能有助于指导未来的政策和基础设施发展。