大语言模型中的医学推理：对DeepSeek R1的深入分析

Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1.

作者信息

Moëll Birger, Sand Aronsson Fredrik, Akbar Sanian

机构信息

Division of Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden.

Division of Speech and Language Pathology, Department of Clinical Science, Intervention and Technology, Karolinska Institutet, Stockholm, Sweden.

出版信息

Front Artif Intell. 2025 Jun 18;8:1616145. doi: 10.3389/frai.2025.1616145. eCollection 2025.

DOI:10.3389/frai.2025.1616145

PMID:40607450

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12213874/

Abstract

INTRODUCTION

The integration of large language models (LLMs) into healthcare holds immense promise, but also raises critical challenges, particularly regarding the interpretability and reliability of their reasoning processes. While models like DeepSeek R1-which incorporates explicit reasoning steps-show promise in enhancing performance and explainability, their alignment with domain-specific expert reasoning remains understudied.

METHODS

This paper evaluates the medical reasoning capabilities of DeepSeek R1, comparing its outputs to the reasoning patterns of medical domain experts.

RESULTS

Through qualitative and quantitative analyses of 100 diverse clinical cases from the MedQA dataset, we demonstrate that DeepSeek R1 achieves 93% diagnostic accuracy and shows patterns of medical reasoning. Analysis of the seven error cases revealed several recurring errors: anchoring bias, difficulty integrating conflicting data, limited consideration of alternative diagnoses, overthinking, incomplete knowledge, and prioritizing definitive treatment over crucial intermediate steps.

DISCUSSION

These findings highlight areas for improvement in LLM reasoning for medical applications. Notably the length of reasoning was important with longer responses having a higher probability for error. The marked disparity in reasoning length suggests that extended explanations may signal uncertainty or reflect attempts to rationalize incorrect conclusions. Shorter responses (e.g., under 5,000 characters) were strongly associated with accuracy, providing a practical threshold for assessing confidence in model-generated answers. Beyond observed reasoning errors, the LLM demonstrated sound clinical judgment by systematically evaluating patient information, forming a differential diagnosis, and selecting appropriate treatment based on established guidelines, drug efficacy, resistance patterns, and patient-specific factors. This ability to integrate complex information and apply clinical knowledge highlights the potential of LLMs for supporting medical decision-making through artificial medical reasoning.

摘要

引言

将大语言模型（LLMs）整合到医疗保健领域具有巨大的潜力，但也带来了严峻的挑战，特别是在其推理过程的可解释性和可靠性方面。虽然像DeepSeek R1这样纳入明确推理步骤的模型在提高性能和可解释性方面显示出了潜力，但其与特定领域专家推理的一致性仍有待深入研究。

方法

本文评估了DeepSeek R1的医学推理能力，并将其输出与医学领域专家的推理模式进行比较。

结果

通过对MedQA数据集中100个不同临床病例的定性和定量分析，我们证明DeepSeek R1实现了93%的诊断准确率，并展现出医学推理模式。对七个错误病例的分析揭示了几个反复出现的错误：锚定偏差、整合冲突数据的困难、对替代诊断的考虑有限、过度思考、知识不完整以及将确定性治疗置于关键中间步骤之上。

讨论

这些发现突出了医学应用中LLM推理需要改进的领域。值得注意的是，推理长度很重要，较长的回答出错的可能性更高。推理长度的显著差异表明，冗长的解释可能表明存在不确定性，或者反映了为错误结论进行合理化的尝试。较短的回答（例如，少于5000个字符）与准确性密切相关，为评估对模型生成答案的信心提供了一个实用的阈值。除了观察到的推理错误外，LLM通过系统地评估患者信息、形成鉴别诊断，并根据既定指南、药物疗效、耐药模式和患者特定因素选择适当的治疗方法，展示了良好的临床判断力。这种整合复杂信息和应用临床知识的能力凸显了LLM通过人工医学推理支持医疗决策的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f994/12213874/68a79c78292d/frai-08-1616145-g0001.jpg

相似文献

Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1.

Front Artif Intell. 2025 Jun 18;8:1616145. doi: 10.3389/frai.2025.1616145. eCollection 2025.

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.

J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.

Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.

JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.

A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.

Sci Rep. 2025 Jul 2;15(1):23101. doi: 10.1038/s41598-025-08601-2.

Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines.

BMC Neurol. 2025 Jul 1;25(1):264. doi: 10.1186/s12883-025-04280-8.

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.

J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.

Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset.

JMIR Med Inform. 2025 Jan 16;13:e65047. doi: 10.2196/65047.

Clinical judgement by primary care physicians for the diagnosis of all-cause dementia or cognitive impairment in symptomatic people.

Cochrane Database Syst Rev. 2022 Jun 16;6(6):CD012558. doi: 10.1002/14651858.CD012558.pub2.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.

J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.

引用本文的文献

Multimodal reasoning agent for enhanced ophthalmic decision-making: a preliminary real-world clinical validation.

Front Cell Dev Biol. 2025 Jul 23;13:1642539. doi: 10.3389/fcell.2025.1642539. eCollection 2025.

Swedish Medical LLM Benchmark: development and evaluation of a framework for assessing large language models in the Swedish medical domain.

Front Artif Intell. 2025 Jul 11;8:1557920. doi: 10.3389/frai.2025.1557920. eCollection 2025.

Harm Reduction Strategies for Thoughtful Use of Large Language Models in the Medical Domain: Perspectives for Patients and Clinicians.

J Med Internet Res. 2025 Jul 25;27:e75849. doi: 10.2196/75849.

Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.

J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.

Cognitive bias in clinical large language models.

NPJ Digit Med. 2025 Jul 10;8(1):428. doi: 10.1038/s41746-025-01790-0.

本文引用的文献

Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge.

medRxiv. 2025 May 6:2025.04.22.25326219. doi: 10.1101/2025.04.22.25326219.

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.

JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.

Enhancing clinical reasoning skills for medical students: a qualitative comparison of LLM-powered social robotic versus computer-based virtual patients within rheumatology.

Rheumatol Int. 2024 Dec;44(12):3041-3051. doi: 10.1007/s00296-024-05731-0. Epub 2024 Oct 16.

Clinical Reasoning Skills Among Second-Phase Medical Students in West Bengal, India: An Exploratory Study.

Cureus. 2024 Sep 6;16(9):e68839. doi: 10.7759/cureus.68839. eCollection 2024 Sep.

Clinical reasoning-the essentials for teaching medical students, trainees and non-medical healthcare professionals.

Br J Hosp Med (Lond). 2024 Jul 30;85(7):1-8. doi: 10.12968/hmed.2024.0052. Epub 2024 Jul 24.

Reasoning with large language models for medical question answering.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1964-1975. doi: 10.1093/jamia/ocae131.

Teaching clinical reasoning: principles from the literature to help improve instruction from the classroom to the bedside.

Korean J Med Educ. 2024 Jun;36(2):145-155. doi: 10.3946/kjme.2024.292. Epub 2024 May 30.

Using theory to interpret how senior clinicians define, learn, and teach clinical reasoning.

MedEdPublish (2016). 2017 Oct 12;6:182. doi: 10.15694/mep.2017.000182. eCollection 2017.

Grounded theory - a lens to understanding clinical reasoning.

MedEdPublish (2016). 2017 Jan 5;6:2. doi: 10.15694/mep.2017.000002. eCollection 2017.

Using the Chief Complaint Driven Medical History: Theoretical Background and Practical Steps for Student Clinicians.

MedEdPublish (2016). 2020 Jan 20;9:17. doi: 10.15694/mep.2020.000017.1. eCollection 2020.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大语言模型中的医学推理：对DeepSeek R1的深入分析

Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

DISCUSSION

引言

方法

结果

讨论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献