Miao Beibei, Sun Qian, Wang Peien, Shao Rongjun, Ding Yingying, Chen Yuanlong, Ying Rongbiao
Department of Thyroid and Breast Surgery, Taizhou Campus of Zhejiang Cancer Hospital (Taizhou Cancer Hospital), No. 50, Zhenxin Road, Xinhe Town, Wenling, 317502, Taizhou, China.
Department of Interventional and Minimally Invasive Surgery, Taizhou Campus of Zhejiang Cancer Hospital (Taizhou Cancer Hospital), No. 50, Zhenxin Road, Xinhe Town, Wenling, 317502, Taizhou, China.
Sci Rep. 2025 Aug 26;15(1):31450. doi: 10.1038/s41598-025-16999-y.
This study aims to investigate and compare the diagnostic performance, disease interpretation reliability, and treatment recommendation capabilities of multiple advanced large language models (GPT-4o, DeepSeek-R1, and DeepSeek-V3) in breast tumor cases. It retrospectively collected comprehensive clinical records of patients with breast tumors treated at Taizhou Cancer Hospital between January and April 2024. The study evaluated the accuracy of tumor classification (benign vs. malignant), the quality of disease interpretation, and the appropriateness of treatment recommendations generated by each model. To assess the clinical interpretability and utility of the models, a comprehensive performance analysis was conducted using statistical methods. A total of 45 patients with breast tumors were included, comprising 37 benign and 8 malignant cases (43 females, 2 males). GPT-4o achieved the highest area under the curve (AUC) for tumor classification (AUC = 0.848), outperforming DeepSeek-R1 (AUC = 0.736) and DeepSeek-V3 (AUC = 0.723). However, DeLong's test indicated that the differences in AUCs among the models were not statistically significant (p > 0.05). In addition, subjective evaluations by doctors indicated that DeepSeek-R1 received the highest scores for disease interpretation (4.73 ± 0.46) and treatment recommendations (4.70 ± 0.51), with consistent ratings.
本研究旨在调查和比较多种先进的大语言模型(GPT-4o、DeepSeek-R1和DeepSeek-V3)在乳腺肿瘤病例中的诊断性能、疾病解读可靠性和治疗建议能力。它回顾性收集了2024年1月至4月在台州肿瘤医院接受治疗的乳腺肿瘤患者的综合临床记录。该研究评估了肿瘤分类(良性与恶性)的准确性、疾病解读的质量以及每个模型生成的治疗建议的适宜性。为了评估模型的临床可解释性和实用性,使用统计方法进行了全面的性能分析。总共纳入了45例乳腺肿瘤患者,包括37例良性病例和8例恶性病例(43名女性,2名男性)。GPT-4o在肿瘤分类方面获得了最高的曲线下面积(AUC)(AUC = 0.848),优于DeepSeek-R1(AUC = 0.736)和DeepSeek-V3(AUC = 0.723)。然而,DeLong检验表明模型之间AUC的差异无统计学意义(p > 0.05)。此外,医生的主观评价表明,DeepSeek-R1在疾病解读(4.73 ± 0.46)和治疗建议(4.70 ± 0.51)方面得分最高,评分一致。