Ho S S, Mills R E
Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
bioRxiv. 2025 Mar 19:2025.03.17.643817. doi: 10.1101/2025.03.17.643817.
The inundating rate of scientific publishing means every researcher will miss new discoveries from overwhelming saturation. To address this limitation, we employ natural language processing to overcome human limitations in reading, curation, and knowledge synthesis, with domain-specific applications to genetics and genomics. We construct a corpus of 3.5 million normalized genetics and genomics abstracts and implement both semantic and network-based embedding models. Our methods not only capture broad biological concepts and relationships but also predict complex phenomena such as gene expression. Through a rigorous temporal validation framework, we demonstrate that our embeddings successfully predict gene-disease associations, cancer driver genes, and experimentally-verified protein interactions years before their formal documentation in literature. Additionally, our embeddings successfully predict experimentally verified gene-gene interactions absent from the literature. These findings demonstrate that substantial undiscovered knowledge exists within the collective scientific literature and that computational approaches can accelerate biological discovery by identifying hidden connections across the fragmented landscape of scientific publishing.
科学出版的泛滥速度意味着每个研究人员都会因信息过载而错过新发现。为解决这一局限,我们运用自然语言处理技术来克服人类在阅读、筛选和知识整合方面的局限,并将其应用于遗传学和基因组学等特定领域。我们构建了一个包含350万条标准化遗传学和基因组学摘要的语料库,并实施了基于语义和网络的嵌入模型。我们的方法不仅能够捕捉广泛的生物学概念和关系,还能预测诸如基因表达等复杂现象。通过严格的时间验证框架,我们证明我们的嵌入模型能够在基因-疾病关联、癌症驱动基因以及经实验验证的蛋白质相互作用在文献中正式记录的数年之前就成功预测它们。此外,我们的嵌入模型还能成功预测文献中未出现的经实验验证的基因-基因相互作用。这些发现表明,在科学文献的整体中存在大量未被发现的知识,并且计算方法可以通过识别科学出版碎片化格局中的隐藏联系来加速生物学发现。