Suppr超能文献

人工智能医生的临床可行性:评估大语言模型在中枢神经系统肿瘤门诊环境中的替代潜力。

Clinical feasibility of AI Doctors: Evaluating the replacement potential of large language models in outpatient settings for central nervous system tumors.

作者信息

Pan Yifeng, Tian Shen, Guo Jing, Cai Hongqing, Wan Jinghai, Fang Cheng

机构信息

The School of Big Data and Artificial Intelligence, Anhui Xinhua University, Hefei, China.

Department of Neurosurgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.

出版信息

Int J Med Inform. 2025 Jun 12;203:106013. doi: 10.1016/j.ijmedinf.2025.106013.

Abstract

BACKGROUND AND OBJECTIVES

The treatment of central nervous system (CNS) tumors is complex and resource-intensive, with higher mortality in underserved regions. Large language models (LLMs) show promise in medical support, but their real-world performance in CNS tumor outpatient care remains unclear. This study aims to assess the diagnostic and treatment capabilities of LLMs in bilingual clinical settings.

METHODS

This retrospective study evaluated three LLMs (ChatGPT-4o, DeepSeek-R1, and Doubao) in assisting neuro-oncology outpatient decision-making within bilingual (Chinese/English) clinical environments. A total of 338 outpatient cases were included, with each model assigned three clinical tasks: differential diagnosis, main diagnosis, and treatment advice. Model outputs were compared against assessments by experienced neurosurgeons. Statistical analysis employed McNemar tests (P < 0.05).

RESULTS

ChatGPT-4o and DeepSeek-R1 achieved over 90 % accuracy in differential diagnosis, showing no significant difference compared to doctors (P > 0.05), while Doubao performed significantly worse (Chinese: P = 0.02, English: P = 0.01). In main diagnosis, both ChatGPT-4o and DeepSeek-R1 showed no significant deviation from doctors performance (P > 0.05), whereas Doubao underperformed (Chinese: P = 0.019, English: P = 0.011). For treatment recommendations, all models showed reduced accuracy (ChatGPT-4o: 80.5 %; DeepSeek-R1: 79 %; Doubao: 71.3 %), significantly lower than doctors (Whether in Chinese or English: P < 0.05). No performance difference was observed between Chinese and English cases.

CONCLUSION

LLMs show strong potential in the preliminary diagnosis and decision support for CNS tumors, and their cross-lingual adaptability underscores their clinical feasibility.

摘要

背景与目的

中枢神经系统(CNS)肿瘤的治疗复杂且资源密集,在医疗服务不足的地区死亡率更高。大语言模型(LLMs)在医疗支持方面显示出前景,但其在CNS肿瘤门诊护理中的实际表现仍不清楚。本研究旨在评估LLMs在双语临床环境中的诊断和治疗能力。

方法

这项回顾性研究评估了三种LLMs(ChatGPT - 4o、DeepSeek - R1和豆包)在双语(中文/英文)临床环境中协助神经肿瘤门诊决策的能力。共纳入338例门诊病例,每个模型被分配三项临床任务:鉴别诊断、主要诊断和治疗建议。将模型输出与经验丰富的神经外科医生的评估进行比较。采用McNemar检验进行统计分析(P < 0.05)。

结果

ChatGPT - 4o和DeepSeek - R1在鉴别诊断中准确率超过90%,与医生相比无显著差异(P > 0.05),而豆包表现明显较差(中文:P = 0.02,英文:P = 0.01)。在主要诊断方面,ChatGPT - 4o和DeepSeek - R1与医生的表现均无显著偏差(P > 0.05),而豆包表现不佳(中文:P = 0.019,英文:P = 0.011)。对于治疗建议,所有模型的准确率均有所降低(ChatGPT - 4o:80.5%;DeepSeek - R1:79%;豆包:71.3%),显著低于医生(中文或英文:P < 0.05)。中文和英文病例之间未观察到性能差异。

结论

LLMs在CNS肿瘤的初步诊断和决策支持方面显示出强大潜力,其跨语言适应性凸显了其临床可行性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验