挖掘PubMed Central补充数据文件的潜力。

Unlocking the potential of PubMed Central supplementary data files.

作者信息

Gobeill Julien, Caucheteur Déborah, Flament Alexandre, Michel Pierre-André, Mottaz Anaïs, Pasche Emilie, Ruch Patrick

机构信息

SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva 1206, Switzerland.

BiTeM Group, Information Sciences, HES-SO/HEG Geneva, Carouge 1227, Switzerland.

出版信息

Bioinform Adv. 2025 Jun 27;5(1):vbaf155. doi: 10.1093/bioadv/vbaf155. eCollection 2025.

DOI:10.1093/bioadv/vbaf155

PMID:40861394

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12371329/

Abstract

MOTIVATION

Biocuration workflows often rely on comprehensive literature searches for specific biological entities. However, standard search engines such as MEDLINE and PubMed Central provide an incomplete picture of the scientific literature because they do not index the increasing amount of valuable information published in supplementary data files. Over two years, we addressed this gap by systematically extracting text from a large proportion (85%) of these files, resulting in 35 million searchable documents. To assess the information gain provided by supplementary data files beyond the manuscripts, we searched both for mentions of dozens of Global Core Biodata Resources (GCBRs), which are fundamental biological databases essential for the life sciences. We searched for mentions of GCBR names and accession numbers, which uniquely identify biological entities within these resources.

RESULTS

The recall gain from using the supplementary data files to search for articles mentioning resource names is 6%. In addition, 97% of all accession numbers identified were published in the supplementary data files, highlighting their increasing importance for highly specific topics or curation pipelines. We show that the number of accession numbers published in the supplementary data files is increasing year on year, but that 87% of these are published in Excel files. This format facilitates human readability and accessibility, but severely limits machine reusability and interoperability. We therefore discuss alternative and complementary approaches to the publication of research data.

AVAILABILITY AND IMPLEMENTATION

All extracted data are accessible and searchable as a collection on the BiodiversityPMC platform (https://biodiversitypmc.sibils.org/).

摘要

动机

生物编目工作流程通常依赖于对特定生物实体进行全面的文献检索。然而，诸如MEDLINE和PubMed Central等标准搜索引擎所提供的科学文献信息并不完整，因为它们没有对发表在补充数据文件中的大量有价值信息进行索引。在两年多的时间里，我们通过系统地从这些文件的很大一部分（85%）中提取文本，填补了这一空白，从而得到了3500万篇可搜索的文档。为了评估补充数据文件相对于手稿所提供的信息增益，我们搜索了数十个全球核心生物数据资源（GCBR），这些资源是生命科学所必需的基础生物数据库。我们搜索了GCBR名称和登录号，这些唯一地标识了这些资源中的生物实体。

结果

使用补充数据文件搜索提及资源名称的文章时，召回率增益为6%。此外，所有识别出的登录号中有97%发表在补充数据文件中，这凸显了它们对于高度特定主题或编目流程日益增长的重要性。我们表明，补充数据文件中发表的登录号数量逐年增加，但其中87%发表在Excel文件中。这种格式便于人类阅读和访问，但严重限制了机器的可重用性和互操作性。因此，我们讨论了研究数据发布的替代方法和补充方法。

可用性和实施

所有提取的数据都可以在BiodiversityPMC平台（https://biodiversitypmc.sibils.org/）上作为一个集合进行访问和搜索。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

挖掘PubMed Central补充数据文件的潜力。

Unlocking the potential of PubMed Central supplementary data files.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实施

相似文献

本文引用的文献

挖掘PubMed Central补充数据文件的潜力。

Unlocking the potential of PubMed Central supplementary data files.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实施

相似文献

本文引用的文献