• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

挖掘PubMed Central补充数据文件的潜力。

Unlocking the potential of PubMed Central supplementary data files.

作者信息

Gobeill Julien, Caucheteur Déborah, Flament Alexandre, Michel Pierre-André, Mottaz Anaïs, Pasche Emilie, Ruch Patrick

机构信息

SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva 1206, Switzerland.

BiTeM Group, Information Sciences, HES-SO/HEG Geneva, Carouge 1227, Switzerland.

出版信息

Bioinform Adv. 2025 Jun 27;5(1):vbaf155. doi: 10.1093/bioadv/vbaf155. eCollection 2025.

DOI:10.1093/bioadv/vbaf155
PMID:40861394
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12371329/
Abstract

MOTIVATION

Biocuration workflows often rely on comprehensive literature searches for specific biological entities. However, standard search engines such as MEDLINE and PubMed Central provide an incomplete picture of the scientific literature because they do not index the increasing amount of valuable information published in supplementary data files. Over two years, we addressed this gap by systematically extracting text from a large proportion (85%) of these files, resulting in 35 million searchable documents. To assess the information gain provided by supplementary data files beyond the manuscripts, we searched both for mentions of dozens of Global Core Biodata Resources (GCBRs), which are fundamental biological databases essential for the life sciences. We searched for mentions of GCBR names and accession numbers, which uniquely identify biological entities within these resources.

RESULTS

The recall gain from using the supplementary data files to search for articles mentioning resource names is 6%. In addition, 97% of all accession numbers identified were published in the supplementary data files, highlighting their increasing importance for highly specific topics or curation pipelines. We show that the number of accession numbers published in the supplementary data files is increasing year on year, but that 87% of these are published in Excel files. This format facilitates human readability and accessibility, but severely limits machine reusability and interoperability. We therefore discuss alternative and complementary approaches to the publication of research data.

AVAILABILITY AND IMPLEMENTATION

All extracted data are accessible and searchable as a collection on the BiodiversityPMC platform (https://biodiversitypmc.sibils.org/).

摘要

动机

生物编目工作流程通常依赖于对特定生物实体进行全面的文献检索。然而,诸如MEDLINE和PubMed Central等标准搜索引擎所提供的科学文献信息并不完整,因为它们没有对发表在补充数据文件中的大量有价值信息进行索引。在两年多的时间里,我们通过系统地从这些文件的很大一部分(85%)中提取文本,填补了这一空白,从而得到了3500万篇可搜索的文档。为了评估补充数据文件相对于手稿所提供的信息增益,我们搜索了数十个全球核心生物数据资源(GCBR),这些资源是生命科学所必需的基础生物数据库。我们搜索了GCBR名称和登录号,这些唯一地标识了这些资源中的生物实体。

结果

使用补充数据文件搜索提及资源名称的文章时,召回率增益为6%。此外,所有识别出的登录号中有97%发表在补充数据文件中,这凸显了它们对于高度特定主题或编目流程日益增长的重要性。我们表明,补充数据文件中发表的登录号数量逐年增加,但其中87%发表在Excel文件中。这种格式便于人类阅读和访问,但严重限制了机器的可重用性和互操作性。因此,我们讨论了研究数据发布的替代方法和补充方法。

可用性和实施

所有提取的数据都可以在BiodiversityPMC平台(https://biodiversitypmc.sibils.org/)上作为一个集合进行访问和搜索。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2784/12371329/122f718bd16d/vbaf155f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2784/12371329/d73022a723c3/vbaf155f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2784/12371329/122f718bd16d/vbaf155f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2784/12371329/d73022a723c3/vbaf155f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2784/12371329/122f718bd16d/vbaf155f2.jpg

相似文献

1
Unlocking the potential of PubMed Central supplementary data files.挖掘PubMed Central补充数据文件的潜力。
Bioinform Adv. 2025 Jun 27;5(1):vbaf155. doi: 10.1093/bioadv/vbaf155. eCollection 2025.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
4
Education support services for improving school engagement and academic performance of children and adolescents with a chronic health condition.改善患有慢性病的儿童和青少年的学校参与度和学业成绩的教育支持服务。
Cochrane Database Syst Rev. 2023 Feb 8;2(2):CD011538. doi: 10.1002/14651858.CD011538.pub2.
5
Short-Term Memory Impairment短期记忆障碍
6
Improving the FAIRness and Sustainability of the NHGRI Resources Ecosystem.提高国家人类基因组研究所资源生态系统的公平性和可持续性。
ArXiv. 2025 Aug 19:arXiv:2508.13498v1.
7
Antidepressants for pain management in adults with chronic pain: a network meta-analysis.抗抑郁药治疗成人慢性疼痛的疼痛管理:一项网络荟萃分析。
Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.
8
Healthcare workers' informal uses of mobile phones and other mobile devices to support their work: a qualitative evidence synthesis.医护人员非正规使用手机和其他移动设备来支持工作:定性证据综合评价。
Cochrane Database Syst Rev. 2024 Aug 27;8(8):CD015705. doi: 10.1002/14651858.CD015705.pub2.
9
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
10
Search strategies to identify diagnostic accuracy studies in MEDLINE and EMBASE.在MEDLINE和EMBASE中识别诊断准确性研究的检索策略。
Cochrane Database Syst Rev. 2013 Sep 11;2013(9):MR000022. doi: 10.1002/14651858.MR000022.pub3.

本文引用的文献

1
Assessing the use of supplementary materials to improve genomic variant discovery.评估使用补充材料来提高基因组变异发现的效果。
Database (Oxford). 2023 Mar 31;2023. doi: 10.1093/database/baad017.
2
Analyzing the Information Content of Text-Based Files in Supplementary Materials of Biomedical Literature.分析生物医学文献补充材料中基于文本文件的信息含量。
Stud Health Technol Inform. 2022 May 25;294:876-877. doi: 10.3233/SHTI220614.
3
SIB Literature Services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts.
SIB 文献服务:生物医学文献中的基于 REST 的可定制搜索引擎,辅以自动映射的生物医学概念。
Nucleic Acids Res. 2020 Jul 2;48(W1):W12-W16. doi: 10.1093/nar/gkaa328.
4
Sharing research data.共享研究数据。
Prosthet Orthot Int. 2020 Apr;44(2):49-51. doi: 10.1177/0309364620915020.
5
A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.全面且定量地比较了 1500 万篇全文文章及其相应摘要中的文本挖掘。
PLoS Comput Biol. 2018 Feb 15;14(2):e1005962. doi: 10.1371/journal.pcbi.1005962. eCollection 2018 Feb.
6
The BioStudies database-one stop shop for all data supporting a life sciences study.BioStudies 数据库——支持生命科学研究的所有数据的一站式商店。
Nucleic Acids Res. 2018 Jan 4;46(D1):D1266-D1270. doi: 10.1093/nar/gkx965.
7
Structuring supplemental materials in support of reproducibility.构建支持可重复性的补充材料。
Genome Biol. 2017 Apr 5;18(1):64. doi: 10.1186/s13059-017-1205-3.
8
The FAIR Guiding Principles for scientific data management and stewardship.科学数据管理和保存的 FAIR 指导原则。
Sci Data. 2016 Mar 15;3:160018. doi: 10.1038/sdata.2016.18.
9
Database citation in supplementary data linked to Europe PubMed Central full text biomedical articles.与欧洲 PubMed 中心全文生物医学文章相关的补充数据中的数据库引用。
J Biomed Semantics. 2015 Jan 5;6:1. doi: 10.1186/2041-1480-6-1. eCollection 2015.
10
Big data from small data: data-sharing in the 'long tail' of neuroscience.从小数据到大数据:神经科学“长尾”中的数据共享。
Nat Neurosci. 2014 Nov;17(11):1442-7. doi: 10.1038/nn.3838.