Tordjman Mickael, Liu Zelong, Yuce Murat, Fauveau Valentin, Mei Yunhao, Hadjadj Jerome, Bolger Ian, Almansour Haidara, Horst Carolyn, Parihar Ashwin Singh, Geahchan Amine, Meribout Anis, Yatim Nader, Ng Nicole, Robson Phillip, Zhou Alexander, Lewis Sara, Huang Mingqian, Deyer Timothy, Taouli Bachir, Lee Hao-Chih, Fayad Zahi A, Mei Xueyan
BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Nat Med. 2025 Apr 23. doi: 10.1038/s41591-025-03726-3.
DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs- DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B-in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Médicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.
DeepSeek是新推出的一款旨在增强推理能力的大语言模型(LLM),但其在医学领域的能力尚未得到评估。在此,我们评估了三款大语言模型——DeepSeek-R1、ChatGPT-o1和Llama 3.1-405B——在执行四项不同医学任务方面的能力:回答美国医师执照考试(USMLE)的问题、基于文本诊断和管理病例进行解读和推理、根据RECIST 1.1标准进行肿瘤分类以及提供多种模态的诊断影像报告总结。在USMLE测试中,DeepSeek-R1的表现(准确率0.92)略逊于ChatGPT-o1(准确率0.95;P = 0.04),但优于Llama 3.1-405B(准确率0.83;P < 10)。对于基于文本的病例挑战,DeepSeek-R1的表现与ChatGPT-o1相似(分别使用《新英格兰医学杂志》和《医学》数据库时,准确率分别为0.57对0.55;P = 0.76和0.74对0.76;P = 0.06)。对于RECIST分类,DeepSeek-R1的表现也与ChatGPT-o1相似(0.74对0.81;P = 0.10)。DeepSeek提供的诊断推理步骤被认为比ChatGPT和Llama 3.1-405B提供的更准确(平均李克特量表得分分别为3.61、3.22和3.13,P = 0.005和P < 10)。然而,DeepSeek-R1提供的影像报告总结的整体质量低于ChatGPT-o1提供的(5分李克特量表得分:4.5对4.8;P < 10)。本研究突出了DeepSeek-R1大语言模型在医学应用方面的潜力,但也强调了需要改进的领域。