Johri Shreya, Jeong Jaehwan, Tran Benjamin A, Schlessinger Daniel I, Wongvibulsin Shannon, Barnes Leandra A, Zhou Hong-Yu, Cai Zhuo Ran, Van Allen Eliezer M, Kim David, Daneshjou Roxana, Rajpurkar Pranav
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Department of Computer Science, Stanford University, Stanford, CA, USA.
Nat Med. 2025 Jan;31(1):77-86. doi: 10.1038/s41591-024-03328-5. Epub 2025 Jan 2.
The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor-patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.
将大语言模型(LLMs)整合到临床诊断中有可能改变医患互动。然而,这些模型在实际临床应用中的准备情况仍未得到充分测试。本文介绍了用于测试医学临床大语言模型的对话推理评估框架(CRAFT-MD)方法。与依赖结构化医学检查的传统方法不同,CRAFT-MD专注于自然对话,使用模拟人工智能代理在受控环境中与大语言模型进行交互。我们应用CRAFT-MD评估了GPT-4、GPT-3.5、米斯特拉尔(Mistral)和LLaMA-2-7B在12个医学专业领域的诊断能力。我们的实验揭示了当前大语言模型在临床对话推理、病史采集和诊断准确性方面的局限性的关键见解。在分析GPT-4V的多模态对话和视觉评估能力时,这些局限性也依然存在。基于我们的实证研究结果,我们为未来临床大语言模型的评估提出了一套全面的建议。这些建议强调现实的医患对话、全面的病史采集、开放式提问以及结合自动化评估和专家评估。CRAFT-MD的引入标志着临床大语言模型测试的一项进步,旨在确保这些模型有效且符合伦理地增强医疗实践。