Suppr超能文献

挖掘PubMed Central补充数据文件的潜力。

Unlocking the potential of PubMed Central supplementary data files.

作者信息

Gobeill Julien, Caucheteur Déborah, Flament Alexandre, Michel Pierre-André, Mottaz Anaïs, Pasche Emilie, Ruch Patrick

机构信息

SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva 1206, Switzerland.

BiTeM Group, Information Sciences, HES-SO/HEG Geneva, Carouge 1227, Switzerland.

出版信息

Bioinform Adv. 2025 Jun 27;5(1):vbaf155. doi: 10.1093/bioadv/vbaf155. eCollection 2025.

Abstract

MOTIVATION

Biocuration workflows often rely on comprehensive literature searches for specific biological entities. However, standard search engines such as MEDLINE and PubMed Central provide an incomplete picture of the scientific literature because they do not index the increasing amount of valuable information published in supplementary data files. Over two years, we addressed this gap by systematically extracting text from a large proportion (85%) of these files, resulting in 35 million searchable documents. To assess the information gain provided by supplementary data files beyond the manuscripts, we searched both for mentions of dozens of Global Core Biodata Resources (GCBRs), which are fundamental biological databases essential for the life sciences. We searched for mentions of GCBR names and accession numbers, which uniquely identify biological entities within these resources.

RESULTS

The recall gain from using the supplementary data files to search for articles mentioning resource names is 6%. In addition, 97% of all accession numbers identified were published in the supplementary data files, highlighting their increasing importance for highly specific topics or curation pipelines. We show that the number of accession numbers published in the supplementary data files is increasing year on year, but that 87% of these are published in Excel files. This format facilitates human readability and accessibility, but severely limits machine reusability and interoperability. We therefore discuss alternative and complementary approaches to the publication of research data.

AVAILABILITY AND IMPLEMENTATION

All extracted data are accessible and searchable as a collection on the BiodiversityPMC platform (https://biodiversitypmc.sibils.org/).

摘要

动机

生物编目工作流程通常依赖于对特定生物实体进行全面的文献检索。然而,诸如MEDLINE和PubMed Central等标准搜索引擎所提供的科学文献信息并不完整,因为它们没有对发表在补充数据文件中的大量有价值信息进行索引。在两年多的时间里,我们通过系统地从这些文件的很大一部分(85%)中提取文本,填补了这一空白,从而得到了3500万篇可搜索的文档。为了评估补充数据文件相对于手稿所提供的信息增益,我们搜索了数十个全球核心生物数据资源(GCBR),这些资源是生命科学所必需的基础生物数据库。我们搜索了GCBR名称和登录号,这些唯一地标识了这些资源中的生物实体。

结果

使用补充数据文件搜索提及资源名称的文章时,召回率增益为6%。此外,所有识别出的登录号中有97%发表在补充数据文件中,这凸显了它们对于高度特定主题或编目流程日益增长的重要性。我们表明,补充数据文件中发表的登录号数量逐年增加,但其中87%发表在Excel文件中。这种格式便于人类阅读和访问,但严重限制了机器的可重用性和互操作性。因此,我们讨论了研究数据发布的替代方法和补充方法。

可用性和实施

所有提取的数据都可以在BiodiversityPMC平台(https://biodiversitypmc.sibils.org/)上作为一个集合进行访问和搜索。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2784/12371329/d73022a723c3/vbaf155f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验