Suppr超能文献

使用检索增强生成技术开发肝脏疾病特异性大语言模型聊天界面。

Development of a liver disease-specific large language model chat interface using retrieval-augmented generation.

作者信息

Ge Jin, Sun Steve, Owens Joseph, Galvez Victor, Gologorskaya Oksana, Lai Jennifer C, Pletcher Mark J, Lai Ki

机构信息

Department of Medicine, Division of Gastroenterology and Hepatology, University of California-San Francisco, San Francisco, California, USA.

UCSF Health Information Technology, University of California-San Francisco, San Francisco, California, USA.

出版信息

Hepatology. 2024 Nov 1;80(5):1158-1168. doi: 10.1097/HEP.0000000000000834. Epub 2024 Mar 7.

Abstract

BACKGROUND AND AIMS

Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows the embedding of customized data into LLMs. This approach "specializes" the LLMs and is thought to reduce hallucinations.

APPROACH AND RESULTS

We developed "LiVersa," a liver disease-specific LLM, by using our institution's protected health information-complaint text embedding and LLM platform, "Versa." We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases guidance documents to be incorporated into LiVersa. We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.

RESULTS

We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.

CONCLUSIONS

In this demonstration, we built disease-specific and protected health information-compliant LLMs using RAG. While LiVersa demonstrated higher accuracy in answering questions related to hepatology, there were some deficiencies due to limitations set by the number of documents used for RAG. LiVersa will likely require further refinement before potential live deployment. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical use cases.

摘要

背景与目的

大语言模型(LLMs)在临床信息处理任务中具有显著能力。然而,市面上的大语言模型并非针对临床应用进行优化,且容易生成幻觉信息。检索增强生成(RAG)是一种企业架构,可将定制数据嵌入大语言模型。这种方法使大语言模型 “专业化”,并被认为能减少幻觉。

方法与结果

我们利用本机构受保护的健康信息投诉文本嵌入和大语言模型平台 “Versa”,开发了一种针对肝脏疾病的大语言模型 “LiVersa”。我们对30份美国肝病研究协会的公开指导文件进行了检索增强生成,以纳入LiVersa。我们通过两轮测试评估了LiVersa的性能。首先,我们将LiVersa的输出与之前发表的知识评估中的学员输出进行比较。LiVersa正确回答了所有10个问题。其次,我们请15位肝病专家评估LiVersa、OpenAI的ChatGPT 4和Meta的大语言模型Meta AI 2生成的10个肝病主题问题的输出。与ChatGPT 4相比,LiVersa的输出更准确,但在全面性和安全性方面的评分较低。

结果

我们通过两轮测试评估了LiVersa的性能。首先,我们将LiVersa的输出与之前发表的知识评估中的学员输出进行比较。LiVersa正确回答了所有10个问题。其次,我们请15位肝病专家评估LiVersa、OpenAI的ChatGPT 4和Meta的大语言模型Meta AI 2生成的10个肝病主题问题的输出。与ChatGPT 4相比,LiVersa的输出更准确,但在全面性和安全性方面的评分较低。

结论

在本演示中,我们使用检索增强生成构建了针对特定疾病且符合受保护健康信息的大语言模型。虽然LiVersa在回答与肝病相关的问题时表现出更高的准确性,但由于用于检索增强生成的文档数量有限,仍存在一些不足。在可能的实际部署之前,LiVersa可能需要进一步完善。然而,LiVersa原型是利用检索增强生成为临床用例定制大语言模型的概念验证。

相似文献

1
Development of a liver disease-specific large language model chat interface using retrieval-augmented generation.
Hepatology. 2024 Nov 1;80(5):1158-1168. doi: 10.1097/HEP.0000000000000834. Epub 2024 Mar 7.
2
4
Large Language Models and Empathy: Systematic Review.
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.

引用本文的文献

1
Development and evaluation of a lightweight large language model chatbot for medication enquiry.
PLOS Digit Health. 2025 Sep 4;4(9):e0000961. doi: 10.1371/journal.pdig.0000961. eCollection 2025 Sep.
4
Graph retrieval augmented large language models for facial phenotype associated rare genetic disease.
NPJ Digit Med. 2025 Aug 24;8(1):543. doi: 10.1038/s41746-025-01955-x.
5
Large language models for clinical decision support in gastroenterology and hepatology.
Nat Rev Gastroenterol Hepatol. 2025 Aug 22. doi: 10.1038/s41575-025-01108-1.
8
Performance of large language models in the differential diagnosis of benign and malignant biliary stricture.
Front Oncol. 2025 Jul 3;15:1613818. doi: 10.3389/fonc.2025.1613818. eCollection 2025.
9
Generative AI in hepatology: Transforming multimodal patient-generated data into actionable insights.
Hepatol Commun. 2025 Jul 14;9(8). doi: 10.1097/HC9.0000000000000683. eCollection 2025 Aug 1.
10
Large language models for disease diagnosis: a scoping review.
NPJ Artif Intell. 2025;1(1):9. doi: 10.1038/s44387-025-00011-z. Epub 2025 Jun 9.

本文引用的文献

1
Bias of AI-generated content: an examination of news produced by large language models.
Sci Rep. 2024 Mar 4;14(1):5224. doi: 10.1038/s41598-024-55686-2.
2
Evaluation of GPT-4 for 10-year cardiovascular risk prediction: Insights from the UK Biobank and KoGES data.
iScience. 2024 Jan 24;27(2):109022. doi: 10.1016/j.isci.2024.109022. eCollection 2024 Feb 16.
3
Almanac - Retrieval-Augmented Language Models for Clinical Medicine.
NEJM AI. 2024 Feb;1(2). doi: 10.1056/aioa2300068. Epub 2024 Jan 25.
4
Prompt Engineering for Generative Artificial Intelligence in Gastroenterology and Hepatology.
Am J Gastroenterol. 2024 Sep 1;119(9):1709-1713. doi: 10.14309/ajg.0000000000002689. Epub 2024 Mar 20.
6
A Comparison of a Large Language Model vs Manual Chart Review for the Extraction of Data Elements From the Electronic Health Record.
Gastroenterology. 2024 Apr;166(4):707-709.e3. doi: 10.1053/j.gastro.2023.12.019. Epub 2023 Dec 25.
8
Comparison of History of Present Illness Summaries Generated by a Chatbot and Senior Internal Medicine Residents.
JAMA Intern Med. 2023 Sep 1;183(9):1026-1027. doi: 10.1001/jamainternmed.2023.2561.
9
Large language models encode clinical knowledge.
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
10
Health system-scale language models are all-purpose prediction engines.
Nature. 2023 Jul;619(7969):357-362. doi: 10.1038/s41586-023-06160-y. Epub 2023 Jun 7.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验