Wang Ling, Li Jinglin, Zhuang Boyang, Huang Shasha, Fang Meilin, Wang Cunze, Li Wen, Zhang Mohan, Gong Shurong
Fuzhou University Affiliated Provincial Hospital, Shengli Clinical Medical College, Fujian Medical University, Fuzhou, China.
School of Pharmacy, Fujian Medical University, Fuzhou, China.
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
Large language models (LLMs) have flourished and gradually become an important research and application direction in the medical field. However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, controversy remains about whether LLMs can be used in the medical field. More studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are inconsistent.
This study uses a network meta-analysis (NMA) to assess the accuracy of LLMs when answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field.
In this systematic review and NMA, we searched PubMed, Embase, Web of Science, and Scopus from inception until October 14, 2024. Studies on the accuracy of LLMs when answering clinical research questions were included and screened by reading published reports. The systematic review and NMA were conducted to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. The NMA was performed using Bayesian frequency theory methods. Indirect intercomparisons between programs were performed using a grading scale. A larger surface under the cumulative ranking curve (SUCRA) value indicates a higher ranking of the corresponding LLM accuracy.
The systematic review and NMA examined 168 articles encompassing 35,896 questions and 3063 clinical cases. Of the 168 studies, 40 (23.8%) were considered to have a low risk of bias, 128 (76.2%) had a moderate risk, and none were rated as having a high risk. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions, followed by Aeyeconsult (SUCRA=0.9187) and ChatGPT-4 (SUCRA=0.8087). ChatGPT-4 (SUCRA=0.8708) excelled at answering open-ended questions. In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis. Gemini (SUCRA=0.9649) had the highest rated SUCRA value for accuracy in the area of triage and classification.
Our study indicates that ChatGPT-4o has an advantage when answering objective questions. For open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate at the top 1 diagnosis and top 3 diagnosis. Claude 3 Opus performs better at the top 5 diagnosis, while for triage and classification, Gemini is more advantageous. This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios.
PROSPERO CRD42024558245; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245.
大语言模型(LLMs)蓬勃发展,逐渐成为医学领域重要的研究与应用方向。然而,由于医学的高度专业化、复杂性和特殊性,导致对准确性要求极高,大语言模型能否应用于医学领域仍存在争议。更多研究评估了各类大语言模型在医学方面的表现,但结论并不一致。
本研究采用网络荟萃分析(NMA)评估大语言模型在回答临床研究问题时的准确性,为其未来在医学领域的发展和应用提供高级别的循证依据。
在这项系统评价和网络荟萃分析中,我们检索了从数据库建库至2024年10月14日的PubMed、Embase、Web of Science和Scopus。纳入关于大语言模型在回答临床研究问题时准确性的研究,并通过阅读已发表报告进行筛选。进行系统评价和网络荟萃分析以比较不同大语言模型在回答临床研究问题时的准确性,包括客观题、开放式问题、前1诊断、前3诊断、前5诊断以及分诊和分类。网络荟萃分析采用贝叶斯频率理论方法。使用分级量表对程序之间进行间接比较。累积排序曲线下面积(SUCRA)值越大,表明相应大语言模型准确性的排名越高。
系统评价和网络荟萃分析审查了168篇文章,涵盖35,896个问题和3063个临床病例。在这168项研究中,40项(23.8%)被认为偏倚风险较低,128项(76.2%)具有中度风险,没有一项被评为高风险。ChatGPT-4o(SUCRA = 0.9207)在回答客观题的准确性方面表现出色,其次是Aeyeconsult(SUCRA = 0.9187)和ChatGPT-4(SUCRA = 0.8087)。ChatGPT-4(SUCRA = 0.8708)在回答开放式问题方面表现出色。在临床病例的前1诊断和前3诊断准确性方面,人类专家(分别为SUCRA = 0.9001和SUCRA = 0.7126)排名最高,而Claude 3 Opus(SUCRA = 0.9672)在前5诊断方面表现良好。Gemini(SUCRA = 0.9649)在分诊和分类领域的准确性方面具有最高的SUCRA评分值。
我们的研究表明,ChatGPT-4o在回答客观题时具有优势。对于开放式问题,ChatGPT-4可能更可靠。人类在前1诊断和前3诊断方面更准确。Claude 3 Opus在前5诊断方面表现更好,而在分诊和分类方面,Gemini更具优势。该分析为临床医生和医学从业者提供了有价值的见解,使他们能够有效地利用大语言模型在各种临床场景的学习、诊断和管理中做出更好的决策。
PROSPERO CRD42024558245;https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245 。