Luo Peng-Wei, Liu Ji-Wen, Xie Xi, Jiang Jia-Wei, Huo Xin-Yu, Chen Zhen-Lin, Huang Zhang-Cheng, Jiang Shao-Qin, Li Meng-Qiang
Department of Urology, Fujian Union Hospital, Fujian Medical University Fuzhou, Fujian, China.
Department of Urology, The First Affiliated Hospital of Chengdu Medical College Chengdu, Sichuan, China.
Am J Clin Exp Urol. 2025 Apr 25;13(2):176-185. doi: 10.62347/UIAP7979. eCollection 2025.
The medical information generated by large language models (LLM) is crucial for improving patient education and clinical decision-making. This study aims to evaluate the performance of two LLMs (DeepSeek and ChatGPT) in answering questions related to prostate cancer radiotherapy in both Chinese and English environments. Through a comparative analysis, we aim to determine which model can provide higher-quality answers in different language environments.
A structured evaluation framework was developed using a set of clinically relevant questions covering three key domains: foundational knowledge, patient education, and treatment and follow-up care. Responses from DeepSeek and ChatGPT were generated in both English and Chinese and independently assessed by a panel of five oncology specialists using a five-point Likert scale. Statistical analyses, including the Wilcoxon signed-rank test, were performed to compare the models' performance across different linguistic contexts.
This study ultimately included 33 questions for scoring. In Chinese, DeepSeek outperformed ChatGPT, achieving top ratings (score = 5) in 75.76% vs. 36.36% of responses (P < 0.001), excelling in foundational knowledge (76.92% vs. 38.46%, = 0.047) and treatment/follow-up (81.82% vs. 36.36%, = 0.031). In English, ChatGPT showed comparable performance (66.7% vs. 54.55% top-rated responses, = 0.236), with marginal advantages in treatment/follow-up (63.64% vs. 54.55%, = 0.563). DeepSeek maintained strengths in English foundational knowledge (69.23% vs. 30.77%, = 0.047) and patient education (88.89% vs. 55.56%, = 0.125). These findings underscore DeepSeek's superior Chinese proficiency and language-specific optimization impacts.
This study shows that DeepSeek performs excellently in providing Chinese medical information, while the two models perform similarly in an English environment. These findings underscore the importance of selecting language-specific artificial intelligence (AI) models to enhance the accuracy and reliability of medical AI applications. While both models show promise in supporting patient education and clinical decision-making, human expert review remains necessary to ensure response accuracy and minimize potential misinformation.
大语言模型(LLM)生成的医学信息对于改善患者教育和临床决策至关重要。本研究旨在评估两个大语言模型(DeepSeek和ChatGPT)在中文和英文环境下回答与前列腺癌放疗相关问题的表现。通过比较分析,我们旨在确定哪个模型在不同语言环境中能提供更高质量的答案。
使用一组涵盖三个关键领域的临床相关问题开发了一个结构化评估框架,这三个领域为基础知识、患者教育以及治疗和后续护理。DeepSeek和ChatGPT的回答分别以英文和中文生成,并由五名肿瘤学专家组成的小组使用五点李克特量表进行独立评估。进行了包括威尔科克森符号秩检验在内的统计分析,以比较模型在不同语言环境下的表现。
本研究最终纳入33个问题进行评分。在中文环境中,DeepSeek的表现优于ChatGPT,在75.76%的回答中获得最高评分(得分 = 5),而ChatGPT为36.36%(P < 0.001),在基础知识方面表现出色(76.92%对38.46%,P = 0.047)以及治疗/后续护理方面(81.82%对36.36%,P = 0.031)。在英文环境中,ChatGPT表现相当(最高评分回答分别为66.7%和54.55%,P = 0.236),在治疗/后续护理方面有微弱优势(63.64%对54.55%,P = 0.563)。DeepSeek在英文基础知识(69.23%对30.77%,P = 0.047)和患者教育方面(88.89%对55.56%,P = 0.125)保持优势。这些发现凸显了DeepSeek卓越的中文能力以及特定语言优化的影响。
本研究表明,DeepSeek在提供中文医学信息方面表现出色,而两个模型在英文环境中表现相似。这些发现强调了选择特定语言的人工智能(AI)模型以提高医学AI应用的准确性和可靠性的重要性。虽然两个模型在支持患者教育和临床决策方面都显示出前景,但仍需要人类专家审核以确保回答的准确性并尽量减少潜在的错误信息。