• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

膀胱癌管理中的增强型人工智能:多个大语言模型的比较分析与优化研究

Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models.

作者信息

Li Kun-Peng, Wang Li, Wan Shun, Wang Chen-Yang, Chen Si-Yu, Liu Shan-Hui, Yang Li

机构信息

Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China.

Institute of Urology, Clinical Research Center for Urology in Gansu Province, Lanzhou, China.

出版信息

J Endourol. 2025 May;39(5):494-499. doi: 10.1089/end.2024.0860. Epub 2025 Mar 18.

DOI:10.1089/end.2024.0860
PMID:40099418
Abstract

With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations. We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance. In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy. This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo's performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.

摘要

随着人工智能在医疗保健领域的迅速发展,大语言模型(LLMs)在医学应用中展现出越来越大的潜力。然而,它们在专业肿瘤学领域的表现仍然有限。本研究评估了多个领先的大语言模型在处理与膀胱癌(BLCA)相关的临床问题时的表现,并展示了如何通过策略性优化来克服这些限制。我们根据既定指南制定了一套全面的100个临床问题。这些问题涵盖了膀胱癌管理的流行病学、诊断、治疗、预后和随访等方面。通过三项独立试验对六个大语言模型(Claude - 3.5 - Sonnet、ChatGPT - 4.0、Grok - beta、Gemini - 1.5 - Pro、Mistral - Large - 2和GPT - 3.5 - Turbo)进行了测试。针对当前临床指南和专家共识对回答进行了验证。我们专门为GPT - 3.5 - Turbo实施了一个两阶段的训练优化过程,以提高其性能。在初始评估中,Claude - 3.5 - Sonnet表现出最高的准确率(89.33% ± 1.53%),其次是ChatGPT - 4(85.67% ± 1.15%)。Grok - beta的准确率达到84.33% ± 1.53%,而Gemini - 1.5 - Pro和Mistral - Large - 2表现相似(分别为82.00% ± 1.00%和81.00% ± 1.00%)。GPT - 3.5 - Turbo表现出最低的准确率(74.33% ± 3.06%)。在第一阶段训练后,GPT - 3.5 - Turbo的准确率提高到86.67% ± 1.89%。在第二阶段优化之后,该模型达到了100%的准确率。本研究不仅确定了各种大语言模型在与膀胱癌相关问题上的比较性能,还验证了通过有针对性的训练优化实现显著改进的潜力。GPT - 3.5 - Turbo性能的成功提升表明,策略性的模型优化可以克服初始限制,并在专业医学应用中实现最佳准确率。

相似文献

1
Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models.膀胱癌管理中的增强型人工智能:多个大语言模型的比较分析与优化研究
J Endourol. 2025 May;39(5):494-499. doi: 10.1089/end.2024.0860. Epub 2025 Mar 18.
2
Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media: Evaluation Study.大型语言模型在模拟人类专家对社交媒体上关于加热烟草制品的公众情绪评估方面的准确性:评估研究。
J Med Internet Res. 2025 Mar 4;27:e63631. doi: 10.2196/63631.
3
Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology.评估大语言模型(ChatGPT、Claude、DeepSeek、Gemini、Grok和Le Chat)在回答关于血液生理学的项目分析多项选择题时的准确性和可靠性。
Cureus. 2025 Apr 8;17(4):e81871. doi: 10.7759/cureus.81871. eCollection 2025 Apr.
4
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
5
The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education.ChatGPT在骨科在职培训考试中的表现:GPT-3.5 turbo和GPT-4模型在骨科教育中的比较研究。
J Orthop. 2023 Nov 23;50:70-75. doi: 10.1016/j.jor.2023.11.056. eCollection 2024 Apr.
6
Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study.大型语言模型在专家级重症监护问题上的比较评估与性能:一项基准研究。
Crit Care. 2025 Feb 10;29(1):72. doi: 10.1186/s13054-025-05302-0.
7
Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study.基于标准化问卷的大语言模型在宫颈癌管理中的性能评估:比较研究
J Med Internet Res. 2025 Feb 5;27:e63626. doi: 10.2196/63626.
8
Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.用于简化介入放射学报告的大语言模型:一项比较分析
Acad Radiol. 2025 Feb;32(2):888-898. doi: 10.1016/j.acra.2024.09.041. Epub 2024 Sep 30.
9
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.生物化学教育中的大语言模型:性能的比较评估
JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.
10
Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.最新大语言模型在回答牙科多项选择题方面的准确性:一项比较研究。
PLoS One. 2025 Jan 29;20(1):e0317423. doi: 10.1371/journal.pone.0317423. eCollection 2025.