Wu Xidong, Zeng Yiming, Das Arun, Jo Sumin, Zhang Tinghe, Patel Parth, Zhang Jianqiu, Gao Shou-Jiang, Pratt Dexter, Chiu Yu-Chiao, Huang Yufei
Electrical and Computer Engineering, University of Pittsburgh.
Hillman Cancer Center, University of Pittsburgh Medical Center.
bioRxiv. 2024 Jan 30:2024.01.27.577521. doi: 10.1101/2024.01.27.577521.
Molecular Regulatory Pathways (MRPs) are crucial for understanding biological functions. Knowledge Graphs (KGs) have become vital in organizing and analyzing MRPs, providing structured representations of complex biological interactions. Current tools for mining KGs from biomedical literature are inadequate in capturing complex, hierarchical relationships and contextual information about MRPs. Large Language Models (LLMs) like GPT-4 offer a promising solution, with advanced capabilities to decipher the intricate nuances of language. However, their potential for end-to-end KG construction, particularly for MRPs, remains largely unexplored.
We present reguloGPT, a novel GPT-4 based in-context learning prompt, designed for the end-to-end joint name entity recognition, N-ary relationship extraction, and context predictions from a sentence that describes regulatory interactions with MRPs. Our reguloGPT approach introduces a context-aware relational graph that effectively embodies the hierarchical structure of MRPs and resolves semantic inconsistencies by embedding context directly within relational edges. We created a benchmark dataset including 400 annotated PubMed titles on N6-methyladenosine (mA) regulations. Rigorous evaluation of reguloGPT on the benchmark dataset demonstrated marked improvement over existing algorithms. We further developed a novel G-Eval scheme, leveraging GPT-4 for annotation-free performance evaluation and demonstrated its agreement with traditional annotation-based evaluations. Utilizing reguloGPT predictions on mA-related titles, we constructed the mA-KG and demonstrated its utility in elucidating mA's regulatory mechanisms in cancer phenotypes across various cancers. These results underscore reguloGPT's transformative potential for extracting biological knowledge from the literature.
The source code of reguloGPT, the mA title and benchmark datasets, and mA-KG are available at: https://github.com/Huang-AI4Medicine-Lab/reguloGPT.
分子调控通路(MRPs)对于理解生物学功能至关重要。知识图谱(KGs)在组织和分析MRPs方面变得至关重要,它提供了复杂生物相互作用的结构化表示。目前从生物医学文献中挖掘KGs的工具在捕捉关于MRPs的复杂、层次关系和上下文信息方面存在不足。像GPT - 4这样的大语言模型(LLMs)提供了一个有前景的解决方案,具有解读语言复杂细微差别的先进能力。然而,它们在端到端KG构建方面的潜力,特别是对于MRPs,在很大程度上仍未被探索。
我们提出了reguloGPT,这是一种基于GPT - 4的新型上下文学习提示,旨在从描述与MRPs调控相互作用的句子中进行端到端联合命名实体识别、N元关系提取和上下文预测。我们的reguloGPT方法引入了一个上下文感知关系图,该图有效地体现了MRPs的层次结构,并通过将上下文直接嵌入关系边来解决语义不一致问题。我们创建了一个基准数据集,其中包括400篇关于N6 - 甲基腺苷(mA)调控的带注释的PubMed标题。在基准数据集上对reguloGPT进行的严格评估表明,它比现有算法有显著改进。我们进一步开发了一种新颖的G - Eval方案,利用GPT - 4进行无注释性能评估,并证明了它与基于传统注释的评估结果一致。利用reguloGPT对与mA相关标题的预测,我们构建了mA - KG,并展示了其在阐明mA在各种癌症的癌症表型中的调控机制方面的效用。这些结果强调了reguloGPT从文献中提取生物学知识的变革潜力。
reguloGPT的源代码、mA标题和基准数据集以及mA - KG可在以下网址获取:https://github.com/Huang - AI4Medicine - Lab/reguloGPT。