Suppr超能文献

生物医学知识图谱优化的大语言模型提示生成。

Biomedical knowledge graph-optimized prompt generation for large language models.

机构信息

Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States.

San Diego Supercomputer Center, University of California, San Diego, CA 92093, United States.

出版信息

Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae560.

Abstract

MOTIVATION

Large language models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains such as biomedicine. Solutions such as pretraining and domain-specific fine-tuning add substantial computational overhead, requiring further domain-expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo, and GPT-4, to generate meaningful biomedical text rooted in established knowledge.

RESULTS

Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.

AVAILABILITY AND IMPLEMENTATION

SPOKE KG can be accessed at https://spoke.rbvi.ucsf.edu/neighborhood.html. It can also be accessed using REST-API (https://spoke.rbvi.ucsf.edu/swagger/). KG-RAG code is made available at https://github.com/BaranziniLab/KG_RAG. Biomedical benchmark datasets used in this study are made available to the research community in the same GitHub repository.

摘要

动机

大型语言模型(LLMs)正在以前所未有的速度被采用,但在生物医学等知识密集型领域仍面临挑战。预训练和特定领域的微调等解决方案增加了大量的计算开销,需要进一步的领域专业知识。在这里,我们通过利用大规模生物医学知识图谱(SPOKE)和 Llama-2-13b、GPT-3.5-Turbo、GPT-4 等 LLM 引入了一种基于令牌优化和稳健的基于知识图谱的检索增强生成(KG-RAG)框架,以生成基于既定知识的有意义的生物医学文本。

结果

与现有的基于知识图谱的 RAG 技术相比,该方法利用最小的图谱模式进行上下文提取,并使用嵌入方法进行上下文修剪。这种上下文提取的优化导致令牌消耗减少了 50%以上,而不会影响准确性,从而在专有的 LLM 上实现了具有成本效益和稳健的 RAG 实现。KG-RAG 通过生成基于既定知识的响应,同时提供准确的出处和统计证据(如果可用)来证实这些说法,从而一致地提高了各种生物医学提示下的 LLM 性能。在人类 curated 数据集(如生物医学真假和多项选择题(MCQ))上的进一步基准测试表明,Llama-2 模型在具有挑战性的 MCQ 数据集上的性能提高了 71%,证明了该框架能够为具有较少参数的开源模型提供特定于领域的问题的能力。此外,KG-RAG 增强了专有 GPT 模型(如 GPT-3.5 和 GPT-4)的性能。总之,该框架以令牌优化的方式结合了知识图谱和 LLM 的显式和隐式知识,从而提高了通用 LLM 以经济高效的方式解决特定领域问题的适应性。

可用性和实现

SPOKE 知识图谱可在 https://spoke.rbvi.ucsf.edu/neighborhood.html 上访问。也可以使用 REST-API(https://spoke.rbvi.ucsf.edu/swagger/)访问。KG-RAG 代码可在 https://github.com/BaranziniLab/KG_RAG 上获得。本研究中使用的生物医学基准数据集可在同一 GitHub 存储库中提供给研究社区。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9357/11441322/cc10c34a018b/btae560f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验