Li Kun-Peng, Wang Li, Wan Shun, Wang Chen-Yang, Chen Si-Yu, Liu Shan-Hui, Yang Li
Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China.
Institute of Urology, Clinical Research Center for Urology in Gansu Province, Lanzhou, China.
J Endourol. 2025 May;39(5):494-499. doi: 10.1089/end.2024.0860. Epub 2025 Mar 18.
With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations. We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance. In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy. This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo's performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.
随着人工智能在医疗保健领域的迅速发展,大语言模型(LLMs)在医学应用中展现出越来越大的潜力。然而,它们在专业肿瘤学领域的表现仍然有限。本研究评估了多个领先的大语言模型在处理与膀胱癌(BLCA)相关的临床问题时的表现,并展示了如何通过策略性优化来克服这些限制。我们根据既定指南制定了一套全面的100个临床问题。这些问题涵盖了膀胱癌管理的流行病学、诊断、治疗、预后和随访等方面。通过三项独立试验对六个大语言模型(Claude - 3.5 - Sonnet、ChatGPT - 4.0、Grok - beta、Gemini - 1.5 - Pro、Mistral - Large - 2和GPT - 3.5 - Turbo)进行了测试。针对当前临床指南和专家共识对回答进行了验证。我们专门为GPT - 3.5 - Turbo实施了一个两阶段的训练优化过程,以提高其性能。在初始评估中,Claude - 3.5 - Sonnet表现出最高的准确率(89.33% ± 1.53%),其次是ChatGPT - 4(85.67% ± 1.15%)。Grok - beta的准确率达到84.33% ± 1.53%,而Gemini - 1.5 - Pro和Mistral - Large - 2表现相似(分别为82.00% ± 1.00%和81.00% ± 1.00%)。GPT - 3.5 - Turbo表现出最低的准确率(74.33% ± 3.06%)。在第一阶段训练后,GPT - 3.5 - Turbo的准确率提高到86.67% ± 1.89%。在第二阶段优化之后,该模型达到了100%的准确率。本研究不仅确定了各种大语言模型在与膀胱癌相关问题上的比较性能,还验证了通过有针对性的训练优化实现显著改进的潜力。GPT - 3.5 - Turbo性能的成功提升表明,策略性的模型优化可以克服初始限制,并在专业医学应用中实现最佳准确率。