Moëll Birger, Sand Aronsson Fredrik, Akbar Sanian
Division of Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden.
Division of Speech and Language Pathology, Department of Clinical Science, Intervention and Technology, Karolinska Institutet, Stockholm, Sweden.
Front Artif Intell. 2025 Jun 18;8:1616145. doi: 10.3389/frai.2025.1616145. eCollection 2025.
The integration of large language models (LLMs) into healthcare holds immense promise, but also raises critical challenges, particularly regarding the interpretability and reliability of their reasoning processes. While models like DeepSeek R1-which incorporates explicit reasoning steps-show promise in enhancing performance and explainability, their alignment with domain-specific expert reasoning remains understudied.
This paper evaluates the medical reasoning capabilities of DeepSeek R1, comparing its outputs to the reasoning patterns of medical domain experts.
Through qualitative and quantitative analyses of 100 diverse clinical cases from the MedQA dataset, we demonstrate that DeepSeek R1 achieves 93% diagnostic accuracy and shows patterns of medical reasoning. Analysis of the seven error cases revealed several recurring errors: anchoring bias, difficulty integrating conflicting data, limited consideration of alternative diagnoses, overthinking, incomplete knowledge, and prioritizing definitive treatment over crucial intermediate steps.
These findings highlight areas for improvement in LLM reasoning for medical applications. Notably the length of reasoning was important with longer responses having a higher probability for error. The marked disparity in reasoning length suggests that extended explanations may signal uncertainty or reflect attempts to rationalize incorrect conclusions. Shorter responses (e.g., under 5,000 characters) were strongly associated with accuracy, providing a practical threshold for assessing confidence in model-generated answers. Beyond observed reasoning errors, the LLM demonstrated sound clinical judgment by systematically evaluating patient information, forming a differential diagnosis, and selecting appropriate treatment based on established guidelines, drug efficacy, resistance patterns, and patient-specific factors. This ability to integrate complex information and apply clinical knowledge highlights the potential of LLMs for supporting medical decision-making through artificial medical reasoning.
将大语言模型(LLMs)整合到医疗保健领域具有巨大的潜力,但也带来了严峻的挑战,特别是在其推理过程的可解释性和可靠性方面。虽然像DeepSeek R1这样纳入明确推理步骤的模型在提高性能和可解释性方面显示出了潜力,但其与特定领域专家推理的一致性仍有待深入研究。
本文评估了DeepSeek R1的医学推理能力,并将其输出与医学领域专家的推理模式进行比较。
通过对MedQA数据集中100个不同临床病例的定性和定量分析,我们证明DeepSeek R1实现了93%的诊断准确率,并展现出医学推理模式。对七个错误病例的分析揭示了几个反复出现的错误:锚定偏差、整合冲突数据的困难、对替代诊断的考虑有限、过度思考、知识不完整以及将确定性治疗置于关键中间步骤之上。
这些发现突出了医学应用中LLM推理需要改进的领域。值得注意的是,推理长度很重要,较长的回答出错的可能性更高。推理长度的显著差异表明,冗长的解释可能表明存在不确定性,或者反映了为错误结论进行合理化的尝试。较短的回答(例如,少于5000个字符)与准确性密切相关,为评估对模型生成答案的信心提供了一个实用的阈值。除了观察到的推理错误外,LLM通过系统地评估患者信息、形成鉴别诊断,并根据既定指南、药物疗效、耐药模式和患者特定因素选择适当的治疗方法,展示了良好的临床判断力。这种整合复杂信息和应用临床知识的能力凸显了LLM通过人工医学推理支持医疗决策的潜力。