Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, 10 Center Dr, Bethesda, MD 20892, USA.
Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, 10 Center Dr, Bethesda, MD 20892, USA.
Am J Hum Genet. 2024 Sep 5;111(9):1819-1833. doi: 10.1016/j.ajhg.2024.07.011. Epub 2024 Aug 14.
Large language models (LLMs) are generating interest in medical settings. For example, LLMs can respond coherently to medical queries by providing plausible differential diagnoses based on clinical notes. However, there are many questions to explore, such as evaluating differences between open- and closed-source LLMs as well as LLM performance on queries from both medical and non-medical users. In this study, we assessed multiple LLMs, including Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT3.5, and ChatGPT-4, as well as non-LLM approaches (Google search and Phenomizer) regarding their ability to identify genetic conditions from textbook-like clinician questions and their corresponding layperson translations related to 63 genetic conditions. For open-source LLMs, larger models were more accurate than smaller LLMs: 7b, 13b, and larger than 33b parameter models obtained accuracy ranges from 21%-49%, 41%-51%, and 54%-68%, respectively. Closed-source LLMs outperformed open-source LLMs, with ChatGPT-4 performing best (89%-90%). Three of 11 LLMs and Google search had significant performance gaps between clinician and layperson prompts. We also evaluated how in-context prompting and keyword removal affected open-source LLM performance. Models were provided with 2 types of in-context prompts: list-type prompts, which improved LLM performance, and definition-type prompts, which did not. We further analyzed removal of rare terms from descriptions, which decreased accuracy for 5 of 7 evaluated LLMs. Finally, we observed much lower performance with real individuals' descriptions; LLMs answered these questions with a maximum 21% accuracy.
大型语言模型(LLM)在医学领域引起了关注。例如,LLM 可以根据临床记录提供合理的鉴别诊断,从而对医疗查询做出连贯的回应。然而,仍有许多问题需要探索,例如评估开源和闭源 LLM 之间的差异,以及 LLM 对来自医疗和非医疗用户的查询的性能。在这项研究中,我们评估了多种 LLM,包括 Llama-2-chat、Vicuna、Medllama2、Bard/Gemini、Claude、ChatGPT3.5 和 ChatGPT-4,以及非 LLM 方法(Google 搜索和 Phenomizer),以了解它们从教科书式的临床医生问题中识别遗传疾病的能力,以及与 63 种遗传疾病相关的相应外行翻译。对于开源 LLM,较大的模型比较小的模型更准确:7b、13b 和 33b 以上参数的模型的准确率范围分别为 21%-49%、41%-51%和 54%-68%。闭源 LLM 的表现优于开源 LLM,其中 ChatGPT-4 的表现最佳(89%-90%)。在 11 个 LLM 和 Google 搜索中有 3 个存在临床医生和外行提示之间的显著性能差距。我们还评估了上下文提示和关键字删除如何影响开源 LLM 的性能。模型提供了 2 种类型的上下文提示:列表类型提示,提高了 LLM 的性能,和定义类型提示,没有提高性能。我们进一步分析了从描述中删除罕见术语对 7 个评估的 LLM 中的 5 个的影响,发现准确性降低。最后,我们观察到真实个人描述的性能要低得多;LLM 对这些问题的回答准确率最高为 21%。