生物医学实验中使用的生物样本元数据的质量参差不齐。

The variable quality of metadata about biological samples used in biomedical experiments.

机构信息

Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA.

出版信息

Sci Data. 2019 Feb 19;6:190021. doi: 10.1038/sdata.2019.21.

DOI:10.1038/sdata.2019.21

PMID:30778255

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6380228/

Abstract

We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample-a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples-a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.

摘要

我们对生物医学实验中使用的样本相关元数据的质量进行了分析研究。分析中的元数据存储在两个著名的数据库中：由美国国家生物技术信息中心（NCBI）管理的 BioSample 存储库，以及由欧洲生物信息学研究所（EBI）管理的 BioSamples 存储库。我们测试了这两个存储库中 1140 万条样本元数据记录是否使用满足这些值的规定要求的值进行了填充。我们的研究揭示了元数据中的多个异常。大多数元数据字段名称及其值没有标准化或受到控制。即使是简单的二进制或数字字段，也经常使用不同数据类型的不适当值进行填充。通过对元数据字段名称进行聚类，我们发现通常有许多不同的方法可以表示样本的同一方面。总的来说，我们分析的元数据表明，缺乏执行和验证元数据要求的原则性机制。我们在元数据中发现的显著异常情况可能会阻碍相关数据集的搜索和二次使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fbc/6380228/7546ef891a19/sdata201921-f1.jpg

相似文献

The variable quality of metadata about biological samples used in biomedical experiments.

Sci Data. 2019 Feb 19;6:190021. doi: 10.1038/sdata.2019.21.

BioSamples database: an updated sample metadata hub.

Nucleic Acids Res. 2019 Jan 8;47(D1):D1172-D1178. doi: 10.1093/nar/gky1061.

Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases.

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz059.

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.

BMC Bioinformatics. 2017 Sep 18;18(1):415. doi: 10.1186/s12859-017-1832-4.

CEDAR OnDemand: a browser extension to generate ontology-based scientific metadata.

BMC Bioinformatics. 2018 Jul 16;19(1):268. doi: 10.1186/s12859-018-2247-6.

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations.

AMIA Annu Symp Proc. 2018 Apr 16;2017:1272-1281. eCollection 2017.

BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.

Nucleic Acids Res. 2012 Jan;40(Database issue):D57-63. doi: 10.1093/nar/gkr1163. Epub 2011 Dec 1.

The Genomic Observatories Metadatabase (GeOMe): A new repository for field and sampling event metadata associated with genetic samples.

PLoS Biol. 2017 Aug 3;15(8):e2002925. doi: 10.1371/journal.pbio.2002925. eCollection 2017 Aug.

A digital repository with an extensible data model for biobanking and genomic analysis management.

BMC Genomics. 2014;15 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2164-15-S3-S3. Epub 2014 May 6.

The CAIRR Pipeline for Submitting Standards-Compliant B and T Cell Receptor Repertoire Sequencing Studies to the National Center for Biotechnology Information Repositories.

Front Immunol. 2018 Aug 16;9:1877. doi: 10.3389/fimmu.2018.01877. eCollection 2018.

引用本文的文献

The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus data repository.

Genome Biol. 2025 Sep 9;26(1):274. doi: 10.1186/s13059-025-03725-0.

Toward Sex-Specific Biomaterials Innovation: A Perspective.

ACS Biomater Sci Eng. 2025 Sep 8;11(9):5131-5144. doi: 10.1021/acsbiomaterials.5c00342. Epub 2025 Aug 20.

Evaluation of DNA barcoding reference databases for marine species in the western and central Pacific Ocean.

PeerJ. 2025 Jul 14;13:e19674. doi: 10.7717/peerj.19674. eCollection 2025.

The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus.

bioRxiv. 2025 Jul 7:2021.11.22.469640. doi: 10.1101/2021.11.22.469640.

Gain efficiency with streamlined and automated data processing: Examples from high-throughput monoclonal antibody production.

PLoS One. 2025 Jul 1;20(7):e0326678. doi: 10.1371/journal.pone.0326678. eCollection 2025.

Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database.

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf070.

Structured Knowledge Base Enhances Effective Use of Large Language Models for Metadata Curation.

AMIA Annu Symp Proc. 2025 May 22;2024:1050-1058. eCollection 2024.

Perceptual and technical barriers in sharing and formatting metadata accompanying omics studies.

Cell Genom. 2025 May 14;5(5):100845. doi: 10.1016/j.xgen.2025.100845. Epub 2025 Apr 10.

aurora: a machine learning gwas tool for analyzing microbial habitat adaptation.

Genome Biol. 2025 Mar 23;26(1):66. doi: 10.1186/s13059-025-03524-7.

Machine learning reveals the dynamic importance of accessory sequences for outbreak clustering.

mBio. 2025 Mar 12;16(3):e0265024. doi: 10.1128/mbio.02650-24. Epub 2025 Jan 28.

本文引用的文献

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata that Describe Scientific Experiments.

Semant Web ISWC. 2017 Oct;10588:103-110. doi: 10.1007/978-3-319-68204-4_10. Epub 2017 Oct 4.

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations.

AMIA Annu Symp Proc. 2018 Apr 16;2017:1272-1281. eCollection 2017.

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.

BMC Bioinformatics. 2017 Sep 18;18(1):415. doi: 10.1186/s12859-017-1832-4.

The FAIR Guiding Principles for scientific data management and stewardship.

Sci Data. 2016 Mar 15;3:160018. doi: 10.1038/sdata.2016.18.

The center for expanded data annotation and retrieval.

J Am Med Inform Assoc. 2015 Nov;22(6):1148-52. doi: 10.1093/jamia/ocv048. Epub 2015 Jun 25.

Updates to BioSamples database at European Bioinformatics Institute.

Nucleic Acids Res. 2014 Jan;42(Database issue):D50-2. doi: 10.1093/nar/gkt1081. Epub 2013 Nov 21.

BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.

Nucleic Acids Res. 2012 Jan;40(Database issue):D57-63. doi: 10.1093/nar/gkr1163. Epub 2011 Dec 1.

The BioSample Database (BioSD) at the European Bioinformatics Institute.

Nucleic Acids Res. 2012 Jan;40(Database issue):D64-70. doi: 10.1093/nar/gkr937. Epub 2011 Nov 16.

The Ontology Lookup Service: bigger and better.

Nucleic Acids Res. 2010 Jul;38(Web Server issue):W155-60. doi: 10.1093/nar/gkq331. Epub 2010 May 11.

BioPortal: ontologies and integrated data resources at the click of a mouse.

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W170-3. doi: 10.1093/nar/gkp440. Epub 2009 May 29.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr
超能文献

生物医学实验中使用的生物样本元数据的质量参差不齐。

The variable quality of metadata about biological samples used in biomedical experiments.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr超能文献

生物医学实验中使用的生物样本元数据的质量参差不齐。

The variable quality of metadata about biological samples used in biomedical experiments.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr
超能文献