定制大语言模型提高准确性：将检索增强生成和人工智能代理与非定制模型在循证医学方面进行比较

Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine.

作者信息

Woo Joshua J, Yang Andrew J, Olsen Reena J, Hasan Sayyida S, Nawabi Danyal H, Nwachukwu Benedict U, Williams Riley J, Ramkumar Prem N

机构信息

Brown University/The Warren Alpert School of Brown University, Providence, Rhode Island, U.S.A.

Tufts University School of Medicine, Boston, Massachusetts, U.S.A.

出版信息

Arthroscopy. 2025 Mar;41(3):565-573.e6. doi: 10.1016/j.arthro.2024.10.042. Epub 2024 Nov 7.

DOI:10.1016/j.arthro.2024.10.042

PMID:39521391

Abstract

PURPOSE

To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case.

METHODS

A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models (LLama3 8b/70b and Mistral 8×7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response.

RESULTS

All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI's GPT4 (95%).

CONCLUSIONS

RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.

CLINICAL RELEVANCE

Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.

摘要

目的

通过前交叉韧带（ACL）损伤病例，展示定制方法（即基于检索增强生成（RAG）的大语言模型（LLMs）和智能体增强）相较于标准大语言模型在提供准确信息方面的价值。

方法

精心整理了一组基于2022年美国骨科学会（AAOS）ACL指南的100个问答。向闭源模型（OpenAI GPT4/GPT 3.5和Anthropic的Claude3）和开源模型（Llama3 8b/70b和Mistral 8×7b）以基本形式提问，然后再将AAOS指南嵌入RAG系统后提问。对表现最佳的模型进一步用人工智能（AI）智能体增强并重新评估。两名经过专科培训的外科医生对每个队列的回答准确性进行盲评。计算面向召回的gist评估替代指标和显式排序翻译评估指标得分，以评估回答中的语义相似度。

结果

所有非定制大语言模型的准确率开始时均低于60%。应用RAG使每个模型的准确率平均提高了39.7%。仅使用RAG时表现最佳的模型是Meta的开源Llama3 70b（94%）。使用RAG和AI智能体时表现最佳的模型是OpenAI的GPT4（95%）。

结论

RAG使准确率平均提高了39.7%，Meta Llama3 70b的准确率最高，为94%。将AI智能体整合到先前经RAG增强的大语言模型中，使ChatGPT4的准确率提高到了95%。因此，智能体增强和RAG增强的大语言模型可以成为准确的信息联络工具，支持我们的假设。

临床意义

尽管有关于大语言模型在医学中应用的文献，但鉴于回答准确率参差不齐，人们一直存在相当多且合理的怀疑态度。本研究为确定使用RAG和智能体增强对大语言模型进行定制修改是否能在骨科护理中更好地提供准确信息奠定了基础。有了这些知识，像ChatGPT这样的流行大语言模型中常见的在线医疗信息可以得到规范，并提供相关的在线医疗信息，以更好地支持外科医生和患者之间的共同决策。