Suppr超能文献

特定领域的嵌入揭示潜在的遗传学知识。

Domain-specific embeddings uncover latent genetics knowledge.

作者信息

Ho S S, Mills R E

机构信息

Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.

Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.

出版信息

bioRxiv. 2025 Mar 19:2025.03.17.643817. doi: 10.1101/2025.03.17.643817.

Abstract

The inundating rate of scientific publishing means every researcher will miss new discoveries from overwhelming saturation. To address this limitation, we employ natural language processing to overcome human limitations in reading, curation, and knowledge synthesis, with domain-specific applications to genetics and genomics. We construct a corpus of 3.5 million normalized genetics and genomics abstracts and implement both semantic and network-based embedding models. Our methods not only capture broad biological concepts and relationships but also predict complex phenomena such as gene expression. Through a rigorous temporal validation framework, we demonstrate that our embeddings successfully predict gene-disease associations, cancer driver genes, and experimentally-verified protein interactions years before their formal documentation in literature. Additionally, our embeddings successfully predict experimentally verified gene-gene interactions absent from the literature. These findings demonstrate that substantial undiscovered knowledge exists within the collective scientific literature and that computational approaches can accelerate biological discovery by identifying hidden connections across the fragmented landscape of scientific publishing.

摘要

科学出版的泛滥速度意味着每个研究人员都会因信息过载而错过新发现。为解决这一局限,我们运用自然语言处理技术来克服人类在阅读、筛选和知识整合方面的局限,并将其应用于遗传学和基因组学等特定领域。我们构建了一个包含350万条标准化遗传学和基因组学摘要的语料库,并实施了基于语义和网络的嵌入模型。我们的方法不仅能够捕捉广泛的生物学概念和关系,还能预测诸如基因表达等复杂现象。通过严格的时间验证框架,我们证明我们的嵌入模型能够在基因-疾病关联、癌症驱动基因以及经实验验证的蛋白质相互作用在文献中正式记录的数年之前就成功预测它们。此外,我们的嵌入模型还能成功预测文献中未出现的经实验验证的基因-基因相互作用。这些发现表明,在科学文献的整体中存在大量未被发现的知识,并且计算方法可以通过识别科学出版碎片化格局中的隐藏联系来加速生物学发现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f878/11957060/d6d3286ff9d7/nihpp-2025.03.17.643817v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验