评估大语言模型在乳腺癌临床场景中的应用：基于 ChatGPT-3.5、ChatGPT-4.0 和 Claude2 的比较分析

Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2.

机构信息

Department of Nursing, Jinzhou Medical University, Jinzhou.

Department of Clinical Trials.

出版信息

Int J Surg. 2024 Apr 1;110(4):1941-1950. doi: 10.1097/JS9.0000000000001066.

DOI:10.1097/JS9.0000000000001066

PMID:38668655

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11019981/

Abstract

BACKGROUND

Large language models (LLMs) have garnered significant attention in the AI domain owing to their exemplary context recognition and response capabilities. However, the potential of LLMs in specific clinical scenarios, particularly in breast cancer diagnosis, treatment, and care, has not been fully explored. This study aimed to compare the performances of three major LLMs in the clinical context of breast cancer.

METHODS

In this study, clinical scenarios designed specifically for breast cancer were segmented into five pivotal domains (nine cases): assessment and diagnosis, treatment decision-making, postoperative care, psychosocial support, and prognosis and rehabilitation. The LLMs were used to generate feedback for various queries related to these domains. For each scenario, a panel of five breast cancer specialists, each with over a decade of experience, evaluated the feedback from LLMs. They assessed feedback concerning LLMs in terms of their quality, relevance, and applicability.

RESULTS

There was a moderate level of agreement among the raters (Fleiss' kappa=0.345, P<0.05). Comparing the performance of different models regarding response length, GPT-4.0 and GPT-3.5 provided relatively longer feedback than Claude2. Furthermore, across the nine case analyses, GPT-4.0 significantly outperformed the other two models in average quality, relevance, and applicability. Within the five clinical areas, GPT-4.0 markedly surpassed GPT-3.5 in the quality of the other four areas and scored higher than Claude2 in tasks related to psychosocial support and treatment decision-making.

CONCLUSION

This study revealed that in the realm of clinical applications for breast cancer, GPT-4.0 showcases not only superiority in terms of quality and relevance but also demonstrates exceptional capability in applicability, especially when compared to GPT-3.5. Relative to Claude2, GPT-4.0 holds advantages in specific domains. With the expanding use of LLMs in the clinical field, ongoing optimization and rigorous accuracy assessments are paramount.

摘要

背景

大型语言模型 (LLM) 因其出色的上下文识别和响应能力而在人工智能领域引起了广泛关注。然而，它们在特定临床场景中的潜力，特别是在乳腺癌诊断、治疗和护理方面，尚未得到充分探索。本研究旨在比较三种主要的 LLM 在乳腺癌临床环境中的性能。

方法

在这项研究中，专门为乳腺癌设计的临床场景被分为五个关键领域（九个案例）：评估和诊断、治疗决策、术后护理、心理社会支持以及预后和康复。使用 LLM 为这些领域的各种查询生成反馈。对于每个场景，一个由五名乳腺癌专家组成的小组，每个专家都有超过十年的经验，评估了 LLM 的反馈。他们根据质量、相关性和适用性来评估 LLM 的反馈。

结果

评分者之间存在中度一致性（Fleiss' kappa=0.345，P<0.05）。比较不同模型在响应长度方面的性能，GPT-4.0 和 GPT-3.5 提供的反馈相对较长。此外，在九个案例分析中，GPT-4.0 在平均质量、相关性和适用性方面明显优于其他两个模型。在五个临床领域中，GPT-4.0 在除了治疗决策之外的其他四个领域的质量方面明显优于 GPT-3.5，并且在与心理社会支持和治疗决策相关的任务中得分高于 Claude2。

结论

本研究表明，在乳腺癌的临床应用领域，GPT-4.0 不仅在质量和相关性方面具有优势，而且在适用性方面表现出色，特别是与 GPT-3.5 相比。与 Claude2 相比，GPT-4.0 在特定领域具有优势。随着 LLM 在临床领域的广泛应用，不断优化和严格的准确性评估至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29a/11019981/f98dc75af3a2/js9-110-1941-g001.jpg

相似文献

Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2.

Int J Surg. 2024 Apr 1;110(4):1941-1950. doi: 10.1097/JS9.0000000000001066.

Prescription of Controlled Substances: Benefits and Risks

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

Evaluation of Large Language Models in Tailoring Educational Content for Cancer Survivors and Their Caregivers: Quality Analysis.

JMIR Cancer. 2025 Apr 7;11:e67914. doi: 10.2196/67914.

Large Language Models and Empathy: Systematic Review.

J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.

Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.

Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70088. doi: 10.1111/1460-6984.70088.

A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.

BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

Examining the Role of Large Language Models in Orthopedics: Systematic Review.

J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.

Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.

J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.

引用本文的文献

Promoting trust and intention to adopt health information generated by ChatGPT among healthcare customers: An empirical study.

Digit Health. 2025 Aug 28;11:20552076251374121. doi: 10.1177/20552076251374121. eCollection 2025 Jan-Dec.

An Exploratory Comparison of AI Models for Preoperative Anesthesia Planning: Assessing ChatGPT-4o, Claude 3.5 Sonnet, and ChatGPT-o1 in Clinical Scenario Analysis.

J Med Syst. 2025 Aug 14;49(1):104. doi: 10.1007/s10916-025-02243-7.

Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis.

Front Pharmacol. 2025 Jul 25;16:1649041. doi: 10.3389/fphar.2025.1649041. eCollection 2025.

Postoperative complication management: How do large language models measure up to human expertise?

PLOS Digit Health. 2025 Aug 1;4(8):e0000933. doi: 10.1371/journal.pdig.0000933. eCollection 2025 Aug.

Advances and challenges in drug repurposing in precision therapeutics of colorectal cancer.

World J Gastrointest Oncol. 2025 Jul 15;17(7):107681. doi: 10.4251/wjgo.v17.i7.107681.

Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.

JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.

Development and validation of a risk prediction model for kinesiophobia in postoperative lung cancer patients: an interpretable machine learning algorithm study.

Sci Rep. 2025 Jun 3;15(1):19412. doi: 10.1038/s41598-025-03575-7.

Assessing the value of artificial intelligence-based image analysis for pre-operative surgical planning of neck dissections and iENE detection in head and neck cancer patients.

Discov Oncol. 2025 May 30;16(1):956. doi: 10.1007/s12672-025-02798-4.

Medical accuracy of artificial intelligence chatbots in oncology: a scoping review.

Oncologist. 2025 Apr 4;30(4). doi: 10.1093/oncolo/oyaf038.

AI-driven patient support: Evaluating the effectiveness of ChatGPT-4 in addressing queries about ovarian cancer compared with healthcare professionals in gynecologic oncology.

Support Care Cancer. 2025 Apr 1;33(4):337. doi: 10.1007/s00520-025-09389-7.

本文引用的文献

Large language models and the future of rheumatology: assessing impact and emerging opportunities.

Curr Opin Rheumatol. 2024 Jan 1;36(1):46-51. doi: 10.1097/BOR.0000000000000981. Epub 2023 Sep 18.

Can ChatGPT Help in the Awareness of Diabetes?

Ann Biomed Eng. 2023 Oct;51(10):2125-2129. doi: 10.1007/s10439-023-03356-1. Epub 2023 Aug 30.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

ChatGPT and Clinical Decision Support: Scope, Application, and Limitations.

Ann Biomed Eng. 2024 May;52(5):1119-1124. doi: 10.1007/s10439-023-03329-4. Epub 2023 Jul 29.

ChatGPT vs Google for Queries Related to Dementia and Other Cognitive Decline: Comparison of Results.

J Med Internet Res. 2023 Jul 25;25:e48966. doi: 10.2196/48966.

Large language models in medicine.

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

The imperative for regulatory oversight of large language models (or generative AI) in healthcare.

NPJ Digit Med. 2023 Jul 6;6(1):120. doi: 10.1038/s41746-023-00873-0.

Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument.

J Med Internet Res. 2023 Jun 30;25:e47479. doi: 10.2196/47479.

Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot.

J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.

ChatGPT for healthcare providers and patients: Practical implications within dermatology.

J Am Acad Dermatol. 2023 Oct;89(4):870-871. doi: 10.1016/j.jaad.2023.05.081. Epub 2023 Jun 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估大语言模型在乳腺癌临床场景中的应用：基于 ChatGPT-3.5、ChatGPT-4.0 和 Claude2 的比较分析

Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2.

机构信息

Department of Nursing, Jinzhou Medical University, Jinzhou.

Department of Clinical Trials.