Li Peng-Hsuan, Sun Yih-Yun, Juan Hsueh-Fen, Chen Chien-Yu, Tsai Huai-Kuang, Huang Jia-Hsin
Taiwan AI Labs, 6F., No. 70, Sec. 1, Chengde Road, Datong Dist., Taipei 10355, Taiwan.
Department of Life Science, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei 10617, Taiwan.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf070.
With the exponential growth of biomedical literature, leveraging Large Language Models (LLMs) for automated medical knowledge understanding has become increasingly critical for advancing precision medicine. However, current approaches face significant challenges in reliability, verifiability, and scalability when extracting complex biological relationships from scientific literature using LLMs. To overcome the obstacles of LLM development in biomedical literature understating, we propose LORE, a novel unsupervised two-stage reading methodology with LLM that models literature as a knowledge graph of verifiable factual statements and, in turn, as semantic embeddings in Euclidean space. LORE captured essential gene pathogenicity information when applied to PubMed abstracts for large-scale understanding of disease-gene relationships. We demonstrated that modeling a latent pathogenic flow in the semantic embedding with supervision from the ClinVar database led to a 90% mean average precision in identifying relevant genes across 2097 diseases. This work provides a scalable and reproducible approach for leveraging LLMs in biomedical literature analysis, offering new opportunities for researchers to identify therapeutic targets efficiently.
随着生物医学文献呈指数级增长,利用大语言模型(LLMs)实现自动医学知识理解对于推进精准医学变得愈发关键。然而,当前在使用大语言模型从科学文献中提取复杂生物关系时,在可靠性、可验证性和可扩展性方面面临重大挑战。为克服大语言模型在生物医学文献理解开发过程中的障碍,我们提出了LORE,这是一种新颖的无监督两阶段阅读方法,该方法使用大语言模型将文献建模为可验证事实陈述的知识图谱,进而建模为欧几里得空间中的语义嵌入。当将LORE应用于PubMed摘要以大规模理解疾病 - 基因关系时,它捕获了基本的基因致病性信息。我们证明,在ClinVar数据库的监督下对语义嵌入中的潜在致病流进行建模,在识别2097种疾病的相关基因时平均精度达到了90%。这项工作为在生物医学文献分析中利用大语言模型提供了一种可扩展且可重复的方法,为研究人员高效识别治疗靶点提供了新机遇。