Ge Jin, Sun Steve, Owens Joseph, Galvez Victor, Gologorskaya Oksana, Lai Jennifer C, Pletcher Mark J, Lai Ki
Department of Medicine, Division of Gastroenterology and Hepatology, University of California-San Francisco, San Francisco, California, USA.
UCSF Health Information Technology, University of California-San Francisco, San Francisco, California, USA.
Hepatology. 2024 Nov 1;80(5):1158-1168. doi: 10.1097/HEP.0000000000000834. Epub 2024 Mar 7.
Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows the embedding of customized data into LLMs. This approach "specializes" the LLMs and is thought to reduce hallucinations.
We developed "LiVersa," a liver disease-specific LLM, by using our institution's protected health information-complaint text embedding and LLM platform, "Versa." We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases guidance documents to be incorporated into LiVersa. We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.
We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.
In this demonstration, we built disease-specific and protected health information-compliant LLMs using RAG. While LiVersa demonstrated higher accuracy in answering questions related to hepatology, there were some deficiencies due to limitations set by the number of documents used for RAG. LiVersa will likely require further refinement before potential live deployment. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical use cases.
大语言模型(LLMs)在临床信息处理任务中具有显著能力。然而,市面上的大语言模型并非针对临床应用进行优化,且容易生成幻觉信息。检索增强生成(RAG)是一种企业架构,可将定制数据嵌入大语言模型。这种方法使大语言模型 “专业化”,并被认为能减少幻觉。
我们利用本机构受保护的健康信息投诉文本嵌入和大语言模型平台 “Versa”,开发了一种针对肝脏疾病的大语言模型 “LiVersa”。我们对30份美国肝病研究协会的公开指导文件进行了检索增强生成,以纳入LiVersa。我们通过两轮测试评估了LiVersa的性能。首先,我们将LiVersa的输出与之前发表的知识评估中的学员输出进行比较。LiVersa正确回答了所有10个问题。其次,我们请15位肝病专家评估LiVersa、OpenAI的ChatGPT 4和Meta的大语言模型Meta AI 2生成的10个肝病主题问题的输出。与ChatGPT 4相比,LiVersa的输出更准确,但在全面性和安全性方面的评分较低。
我们通过两轮测试评估了LiVersa的性能。首先,我们将LiVersa的输出与之前发表的知识评估中的学员输出进行比较。LiVersa正确回答了所有10个问题。其次,我们请15位肝病专家评估LiVersa、OpenAI的ChatGPT 4和Meta的大语言模型Meta AI 2生成的10个肝病主题问题的输出。与ChatGPT 4相比,LiVersa的输出更准确,但在全面性和安全性方面的评分较低。
在本演示中,我们使用检索增强生成构建了针对特定疾病且符合受保护健康信息的大语言模型。虽然LiVersa在回答与肝病相关的问题时表现出更高的准确性,但由于用于检索增强生成的文档数量有限,仍存在一些不足。在可能的实际部署之前,LiVersa可能需要进一步完善。然而,LiVersa原型是利用检索增强生成为临床用例定制大语言模型的概念验证。