Global Biodata Coalition, Strasbourg, France.
University Library, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America.
PLoS One. 2023 Nov 28;18(11):e0294812. doi: 10.1371/journal.pone.0294812. eCollection 2023.
Modern biological research depends on data resources. These resources archive difficult-to-reproduce data and provide added-value aggregation, curation, and analyses. Collectively, they constitute a global infrastructure of biodata resources. While the organic proliferation of biodata resources has enabled incredible research, sustained support for the individual resources that make up this distributed infrastructure is a challenge. The Global Biodata Coalition (GBC) was established by research funders in part to aid in developing sustainable funding strategies for biodata resources. An important component of this work is understanding the scope of the resource infrastructure; how many biodata resources there are, where they are, and how they are supported. Existing registries require self-registration and/or extensive curation, and we sought to develop a method for assembling a global inventory of biodata resources that could be periodically updated with minimal human intervention. The approach we developed identifies biodata resources using open data from the scientific literature. Specifically, we used a machine learning-enabled natural language processing approach to identify biodata resources from titles and abstracts of life sciences publications contained in Europe PMC. Pretrained BERT (Bidirectional Encoder Representations from Transformers) models were fine-tuned to classify publications as describing a biodata resource or not and to predict the resource name using named entity recognition. To improve the quality of the resulting inventory, low-confidence predictions and potential duplicates were manually reviewed. Further information about the resources were then obtained using article metadata, such as funder and geolocation information. These efforts yielded an inventory of 3112 unique biodata resources based on articles published from 2011-2021. The code was developed to facilitate reuse and includes automated pipelines. All products of this effort are released under permissive licensing, including the biodata resource inventory itself (CC0) and all associated code (BSD/MIT).
现代生物学研究依赖于数据资源。这些资源归档了难以重现的数据,并提供了增值聚合、管理和分析。它们共同构成了一个全球性的生物数据资源基础设施。虽然生物数据资源的有机增长使研究变得令人难以置信,但要持续支持构成这一分布式基础设施的各个资源是一个挑战。全球生物数据联盟 (GBC) 的成立部分是为了帮助制定生物数据资源的可持续资助策略。这项工作的一个重要组成部分是了解资源基础设施的范围;有多少生物数据资源,它们在哪里,以及它们是如何得到支持的。现有的登记处需要自我登记和/或广泛的管理,我们试图开发一种方法来汇编一个全球性的生物数据资源清单,该清单可以在最少的人工干预下定期更新。我们开发的方法使用来自科学文献的开放数据来识别生物数据资源。具体来说,我们使用了一种基于机器学习的自然语言处理方法,从欧洲 PMC 中包含的生命科学出版物的标题和摘要中识别生物数据资源。经过预训练的 BERT(来自 Transformer 的双向编码器表示)模型被微调,以分类出版物是否描述了生物数据资源,并使用命名实体识别来预测资源名称。为了提高库存的质量,对低置信度的预测和潜在的重复项进行了手动审查。然后使用文章元数据(如资助者和地理位置信息)获取有关资源的更多信息。这些努力产生了一个基于 2011 年至 2021 年发表的文章的 3112 个独特生物数据资源的清单。该代码是为了促进重用而开发的,包括自动化管道。这项工作的所有产品都以宽松的许可证发布,包括生物数据资源清单本身(CC0)和所有相关代码(BSD/MIT)。