利用基于深度学习的命名实体识别提高 SRA BioSample 条目的元数据覆盖范围。

Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition.

机构信息

Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.

Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA.

出版信息

Database (Oxford). 2021 Apr 29;2021. doi: 10.1093/database/baab021.

DOI:10.1093/database/baab021

PMID:33914028

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8083811/

Abstract

High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE.

摘要

高质量的元数据注释对于数据的可重复性研究和快速、强大和可扩展的元分析至关重要。目前，国家生物技术信息中心序列读取档案 (SRA) 中的大多数测序样本在多个类别中都缺少元数据。为了提高这些样本的元数据覆盖率，我们利用了来自 SRA BioSample 的近 4400 万属性值对来训练一个可扩展的递归神经网络，通过命名实体识别 (NER) 预测缺失的元数据。该网络首先根据 11 个元数据类别来对短文本短语进行分类，总体准确率和接收者操作特征曲线下的面积分别达到 85.2%和 0.977。然后，我们将我们的分类器应用于从样本的较长 TITLE 属性中预测 11 个元数据类别，在一组从模型训练中保留的样本上评估性能。从 TITLE 中提取样本属/种 (94.85%)、条件/疾病 (95.65%)和菌株 (82.03%)时，预测准确率很高，而其他类别的准确率较低且缺乏预测，突出了 BioSample 中当前元数据注释的多个问题。这些结果表明，递归神经网络在基于 NER 的元数据预测方面具有实用性，并且像这里提出的模型这样的模型具有增加 BioSample 中元数据覆盖率的潜力，同时最大限度地减少对人工策展的需求。数据库 URL：https://github.com/cartercompbio/PredictMEE。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9cf/8083811/8729ce7dfe6b/baab021f1.jpg

相似文献

Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition.利用基于深度学习的命名实体识别提高 SRA BioSample 条目的元数据覆盖范围。

Database (Oxford). 2021 Apr 29;2021. doi: 10.1093/database/baab021.

pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive.pysradb：一个用于查询来自NCBI序列读取存档库的下一代测序元数据和数据的Python包。

F1000Res. 2019 Apr 23;8:532. doi: 10.12688/f1000research.18676.1. eCollection 2019.

The CAIRR Pipeline for Submitting Standards-Compliant B and T Cell Receptor Repertoire Sequencing Studies to the National Center for Biotechnology Information Repositories.CAIRR 管道用于向国家生物技术信息中心存储库提交符合标准的 B 和 T 细胞受体文库测序研究。

Front Immunol. 2018 Aug 16;9:1877. doi: 10.3389/fimmu.2018.01877. eCollection 2018.

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.MetaSRA：序列读取档案中标准化的人类样本特定元数据。

Bioinformatics. 2017 Sep 15;33(18):2914-2923. doi: 10.1093/bioinformatics/btx334.

The Genomic Observatories Metadatabase (GeOMe): A new repository for field and sampling event metadata associated with genetic samples.基因组观测元数据库（GeOMe）：一个用于存储与基因样本相关的野外和采样事件元数据的新库。

PLoS Biol. 2017 Aug 3;15(8):e2002925. doi: 10.1371/journal.pbio.2002925. eCollection 2017 Aug.

A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health.基于搜索的地理元数据编目管道，用于精炼测序机构信息并支持公共卫生。

Front Public Health. 2023 Nov 14;11:1254976. doi: 10.3389/fpubh.2023.1254976. eCollection 2023.

Sequencing data discovery with MetaSeek.利用 MetaSeek 进行测序数据发现。

Bioinformatics. 2019 Nov 1;35(22):4857-4859. doi: 10.1093/bioinformatics/btz499.

MetaRNA-Seq: An Interactive Tool to Browse and Annotate Metadata from RNA-Seq Studies.MetaRNA-Seq：一个用于浏览和注释RNA测序研究元数据的交互式工具。

Biomed Res Int. 2015;2015:318064. doi: 10.1155/2015/318064. Epub 2015 Aug 25.

"METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI's sequence read archive".METAGENOTE：一个简化的基因组样本元数据注释的网络平台，简化了向 NCBI 的序列读取档案提交的流程。

BMC Bioinformatics. 2020 Sep 3;21(1):378. doi: 10.1186/s12859-020-03694-0.

MarineMetagenomeDB: a public repository for curated and standardized metadata for marine metagenomes.海洋宏基因组数据库：一个用于整理和标准化海洋宏基因组元数据的公共存储库。

Environ Microbiome. 2022 Nov 18;17(1):57. doi: 10.1186/s40793-022-00449-7.

引用本文的文献

Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database.使用大语言模型提取生物学术语可提高生物样本数据库中元数据的可用性。

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf070.

Artificial Intelligence in Pediatrics: Learning to Walk Together.儿科学中的人工智能：携手共进。

Turk Arch Pediatr. 2024 Mar;59(2):121-130. doi: 10.5152/TurkArchPediatr.2024.24002.

A comprehensive overview of microbiome data in the light of machine learning applications: categorization, accessibility, and future directions.基于机器学习应用的微生物组数据综合概述：分类、可及性及未来方向。

Front Microbiol. 2024 Feb 13;15:1343572. doi: 10.3389/fmicb.2024.1343572. eCollection 2024.

Front Public Health. 2023 Nov 14;11:1254976. doi: 10.3389/fpubh.2023.1254976. eCollection 2023.

Metadata retrieval from sequence databases with ffq.利用 ffq 从序列数据库中检索元数据。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac667.

Systematic tissue annotations of genomics samples by modeling unstructured metadata.通过对非结构化元数据进行建模来对基因组学样本进行系统的组织注释。

Nat Commun. 2022 Nov 8;13(1):6736. doi: 10.1038/s41467-022-34435-x.

Machine Learning Data Analysis Highlights the Role of and in Autism Spectrum Disorders.机器学习数据分析凸显了[具体内容]和[具体内容]在自闭症谱系障碍中的作用。（原文中两个“and”之间缺失关键信息）

Biomedicines. 2022 Aug 19;10(8):2028. doi: 10.3390/biomedicines10082028.

SKIOME Project: a curated collection of skin microbiome datasets enriched with study-related metadata.SKIOME 项目：一个经过策展的皮肤微生物组数据集集合，其中包含丰富的与研究相关的元数据。

Database (Oxford). 2022 May 16;2022. doi: 10.1093/database/baac033.

本文引用的文献

The variable quality of metadata about biological samples used in biomedical experiments.生物医学实验中使用的生物样本元数据的质量参差不齐。

Sci Data. 2019 Feb 19;6:190021. doi: 10.1038/sdata.2019.21.

Mining data and metadata from the gene expression omnibus.从基因表达综合数据库挖掘数据和元数据。

Biophys Rev. 2019 Feb;11(1):103-110. doi: 10.1007/s12551-018-0490-8. Epub 2018 Dec 29.

CEDAR OnDemand: a browser extension to generate ontology-based scientific metadata.CEDAR OnDemand：一个基于本体的科学元数据生成的浏览器扩展。

BMC Bioinformatics. 2018 Jul 16;19(1):268. doi: 10.1186/s12859-018-2247-6.

Massive mining of publicly available RNA-seq data from human and mouse.大规模挖掘人类和小鼠公共可用的 RNA-seq 数据。

Nat Commun. 2018 Apr 10;9(1):1366. doi: 10.1038/s41467-018-03751-6.

Adaptive Immune Receptor Repertoire Community recommendations for sharing immune-repertoire sequencing data.适应性免疫受体库社区关于共享免疫受体测序数据的建议。

Nat Immunol. 2017 Nov 16;18(12):1274-1278. doi: 10.1038/ni.3873.

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.通过聚类进行清理：解决生物医学元数据中数据质量问题的方法。

BMC Bioinformatics. 2017 Sep 18;18(1):415. doi: 10.1186/s12859-017-1832-4.

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.MetaSRA：序列读取档案中标准化的人类样本特定元数据。

Bioinformatics. 2017 Sep 15;33(18):2914-2923. doi: 10.1093/bioinformatics/btx334.

Reproducible RNA-seq analysis using recount2.使用recount2进行可重复的RNA测序分析。

Nat Biotechnol. 2017 Apr 11;35(4):319-321. doi: 10.1038/nbt.3838.

The FAIR Guiding Principles for scientific data management and stewardship.科学数据管理和保存的 FAIR 指导原则。

Sci Data. 2016 Mar 15;3:160018. doi: 10.1038/sdata.2016.18.

BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.NCBI 的 BioProject 和 BioSample 数据库：促进元数据的捕获和组织。

Nucleic Acids Res. 2012 Jan;40(Database issue):D57-63. doi: 10.1093/nar/gkr1163. Epub 2011 Dec 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用基于深度学习的命名实体识别提高 SRA BioSample 条目的元数据覆盖范围。

Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献