基于机器学习的科学文献中开放生物数据资源目录。

A machine learning-enabled open biodata resource inventory from the scientific literature.

机构信息

Global Biodata Coalition, Strasbourg, France.

University Library, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America.

出版信息

PLoS One. 2023 Nov 28;18(11):e0294812. doi: 10.1371/journal.pone.0294812. eCollection 2023.

DOI:10.1371/journal.pone.0294812

PMID:38015968

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10684096/

Abstract

Modern biological research depends on data resources. These resources archive difficult-to-reproduce data and provide added-value aggregation, curation, and analyses. Collectively, they constitute a global infrastructure of biodata resources. While the organic proliferation of biodata resources has enabled incredible research, sustained support for the individual resources that make up this distributed infrastructure is a challenge. The Global Biodata Coalition (GBC) was established by research funders in part to aid in developing sustainable funding strategies for biodata resources. An important component of this work is understanding the scope of the resource infrastructure; how many biodata resources there are, where they are, and how they are supported. Existing registries require self-registration and/or extensive curation, and we sought to develop a method for assembling a global inventory of biodata resources that could be periodically updated with minimal human intervention. The approach we developed identifies biodata resources using open data from the scientific literature. Specifically, we used a machine learning-enabled natural language processing approach to identify biodata resources from titles and abstracts of life sciences publications contained in Europe PMC. Pretrained BERT (Bidirectional Encoder Representations from Transformers) models were fine-tuned to classify publications as describing a biodata resource or not and to predict the resource name using named entity recognition. To improve the quality of the resulting inventory, low-confidence predictions and potential duplicates were manually reviewed. Further information about the resources were then obtained using article metadata, such as funder and geolocation information. These efforts yielded an inventory of 3112 unique biodata resources based on articles published from 2011-2021. The code was developed to facilitate reuse and includes automated pipelines. All products of this effort are released under permissive licensing, including the biodata resource inventory itself (CC0) and all associated code (BSD/MIT).

摘要

现代生物学研究依赖于数据资源。这些资源归档了难以重现的数据，并提供了增值聚合、管理和分析。它们共同构成了一个全球性的生物数据资源基础设施。虽然生物数据资源的有机增长使研究变得令人难以置信，但要持续支持构成这一分布式基础设施的各个资源是一个挑战。全球生物数据联盟 (GBC) 的成立部分是为了帮助制定生物数据资源的可持续资助策略。这项工作的一个重要组成部分是了解资源基础设施的范围；有多少生物数据资源，它们在哪里，以及它们是如何得到支持的。现有的登记处需要自我登记和/或广泛的管理，我们试图开发一种方法来汇编一个全球性的生物数据资源清单，该清单可以在最少的人工干预下定期更新。我们开发的方法使用来自科学文献的开放数据来识别生物数据资源。具体来说，我们使用了一种基于机器学习的自然语言处理方法，从欧洲 PMC 中包含的生命科学出版物的标题和摘要中识别生物数据资源。经过预训练的 BERT（来自 Transformer 的双向编码器表示）模型被微调，以分类出版物是否描述了生物数据资源，并使用命名实体识别来预测资源名称。为了提高库存的质量，对低置信度的预测和潜在的重复项进行了手动审查。然后使用文章元数据（如资助者和地理位置信息）获取有关资源的更多信息。这些努力产生了一个基于 2011 年至 2021 年发表的文章的 3112 个独特生物数据资源的清单。该代码是为了促进重用而开发的，包括自动化管道。这项工作的所有产品都以宽松的许可证发布，包括生物数据资源清单本身（CC0）和所有相关代码（BSD/MIT）。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6cb5/10684096/e622f2769bf1/pone.0294812.g001.jpg

相似文献

A machine learning-enabled open biodata resource inventory from the scientific literature.

PLoS One. 2023 Nov 28;18(11):e0294812. doi: 10.1371/journal.pone.0294812. eCollection 2023.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Deep learning to refine the identification of high-quality clinical research articles from the biomedical literature: Performance evaluation.

J Biomed Inform. 2023 Jun;142:104384. doi: 10.1016/j.jbi.2023.104384. Epub 2023 May 8.

PomBase: a Global Core Biodata Resource-growth, collaboration, and sustainability.

Genetics. 2024 May 7;227(1). doi: 10.1093/genetics/iyae007.

Classifying literature mentions of biological pathogens as experimentally studied using natural language processing.

J Biomed Semantics. 2023 Jan 31;14(1):1. doi: 10.1186/s13326-023-00282-y.

Resource Disambiguator for the Web: Extracting Biomedical Resources and Their Citations from the Scientific Literature.

PLoS One. 2016 Jan 5;11(1):e0146300. doi: 10.1371/journal.pone.0146300. eCollection 2016.

BatteryBERT: A Pretrained Language Model for Battery Database Enhancement.

J Chem Inf Model. 2022 Dec 26;62(24):6365-6377. doi: 10.1021/acs.jcim.2c00035. Epub 2022 May 9.

A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications.

Gigascience. 2022 Aug 11;11. doi: 10.1093/gigascience/giac077.

Extracting comprehensive clinical information for breast cancer using deep learning methods.

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

Searching and Evaluating Publications and Preprints Using Europe PMC.

Curr Protoc. 2023 Mar;3(3):e694. doi: 10.1002/cpz1.694.

引用本文的文献

Perspectives on tracking data reuse across biodata resources.

Bioinform Adv. 2024 Apr 25;4(1):vbae057. doi: 10.1093/bioadv/vbae057. eCollection 2024.

本文引用的文献

The 2023 Nucleic Acids Research Database Issue and the online molecular biology database collection.

Nucleic Acids Res. 2023 Jan 6;51(D1):D1-D8. doi: 10.1093/nar/gkac1186.

Reproducibility standards for machine learning in the life sciences.

Nat Methods. 2021 Oct;18(10):1132-1135. doi: 10.1038/s41592-021-01256-7.

Europe PMC in 2020.

Nucleic Acids Res. 2021 Jan 8;49(D1):D1507-D1514. doi: 10.1093/nar/gkaa994.

Array programming with NumPy.

Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16.

Towards a catalogue of biodiversity databases: An ontological case study.

Biodivers Data J. 2020 Mar 27;8:e32765. doi: 10.3897/BDJ.8.e32765. eCollection 2020.

The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences.

Bioinformatics. 2020 Apr 15;36(8):2636-2642. doi: 10.1093/bioinformatics/btz959.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Digital expression explorer 2: a repository of uniformly processed RNA sequencing data.

Gigascience. 2019 Apr 1;8(4). doi: 10.1093/gigascience/giz022.

FAIRsharing as a community approach to standards, repositories and policies.

Nat Biotechnol. 2019 Apr;37(4):358-367. doi: 10.1038/s41587-019-0080-8.

Funding knowledgebases: Towards a sustainable funding model for the UniProt use case.

F1000Res. 2017 Nov 27;6. doi: 10.12688/f1000research.12989.2. eCollection 2017.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于机器学习的科学文献中开放生物数据资源目录。

A machine learning-enabled open biodata resource inventory from the scientific literature.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献