大语言模型作为肿瘤学决策工具：比较人工智能建议与专家推荐

Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations.

作者信息

Ah-Thiane Loic, Heudel Pierre-Etienne, Campone Mario, Robert Marie, Brillaud-Meflah Victoire, Rousseau Caroline, Le Blanc-Onfroy Magali, Tomaszewski Florine, Supiot Stéphane, Perennec Tanguy, Mervoyer Augustin, Frenel Jean-Sébastien

机构信息

Department of Radiotherapy, ICO Rene Gauducheau, Saint-Herblain, France.

Department of Medical Oncology, Center Léon Bérard, Lyon, France.

出版信息

JCO Clin Cancer Inform. 2025 Mar;9:e2400230. doi: 10.1200/CCI-24-00230. Epub 2025 Mar 20.

DOI:10.1200/CCI-24-00230

PMID:40112233

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11949217/

Abstract

PURPOSE

To determine the accuracy of large language models (LLMs) in generating appropriate treatment options for patients with early breast cancer on the basis of their medical records.

MATERIALS AND METHODS

Retrospective study using anonymized medical records of patients with BC presented during multidisciplinary team meetings (MDTs) between January and April 2024. Three generalist artificial intelligence models (Claude3-Opus, GPT4-Turbo, and LLaMa3-70B) were used to generate treatment suggestions, which were compared with experts' decisions. The primary outcome was the rate of appropriate suggestions from the LLMs, compared with the reference experts' decisions. The secondary outcome was the LLMs' performances (F1 score and specificity) in generating appropriate suggestions for each treatment category.

RESULTS

The rates of appropriate suggestions were 86.6% (97/112), 85.7% (96/112), and 75.0% (84/112) for Claude3-Opus, GPT4-Turbo, and LLaMa3-70B, respectively. No significant difference was found between Claude3-Opus and GPT4-Turbo ( = .85), but both tended to perform better than LLaMa3-70B ( = .027 and = .043, respectively). LLMs showed high accuracy for adjuvant endocrine therapy and targeted therapy indications. However, they tended to overestimate the need for adjuvant radiotherapy and had variable performances in suggesting adjuvant chemotherapy and genomic tests.

CONCLUSION

LLMs, particularly Claude3-Opus and GPT4-Turbo, demonstrated promising accuracy in suggesting appropriate adjuvant treatments for patients with early BC on the basis of their medical records. Although LLMs showed limitations in validating surgery and indicating genomic tests, their performance in other treatment modalities highlights their potential to automate and augment decision making during MDTs. Further studies with fine-tuned LLMs and a prospective design are needed to demonstrate their utility in clinical practice.

摘要

目的

基于早期乳腺癌患者的病历，确定大语言模型（LLMs）生成合适治疗方案的准确性。

材料与方法

采用回顾性研究，使用2024年1月至4月多学科团队会议（MDTs）期间呈现的乳腺癌患者匿名病历。使用三种通用人工智能模型（Claude3-Opus、GPT4-Turbo和LLaMa3-70B）生成治疗建议，并与专家的决策进行比较。主要结果是与参考专家决策相比，大语言模型给出合适建议的比例。次要结果是大语言模型在为每个治疗类别生成合适建议时的表现（F1分数和特异性）。

结果

Claude3-Opus、GPT4-Turbo和LLaMa3-70B给出合适建议的比例分别为86.6%（97/112）、85.7%（96/112）和75.0%（84/112）。Claude3-Opus和GPT4-Turbo之间未发现显著差异（P = 0.85），但两者的表现均优于LLaMa3-70B（P分别为0.027和0.043）。大语言模型在辅助内分泌治疗和靶向治疗适应症方面显示出较高的准确性。然而，它们往往高估了辅助放疗的必要性，并且在建议辅助化疗和基因检测方面表现不一。

结论

大语言模型，特别是Claude3-Opus和GPT4-Turbo，在基于早期乳腺癌患者病历建议合适的辅助治疗方面显示出有前景的准确性。尽管大语言模型在验证手术和指示基因检测方面存在局限性，但其在其他治疗方式中的表现凸显了它们在多学科团队会议期间实现决策自动化和增强决策的潜力。需要对经过微调的大语言模型进行进一步的前瞻性研究，以证明它们在临床实践中的效用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fc8/11949217/31a718fbf96b/cci-9-e2400230-g001.jpg

相似文献

Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations.大语言模型作为肿瘤学决策工具：比较人工智能建议与专家推荐

JCO Clin Cancer Inform. 2025 Mar;9:e2400230. doi: 10.1200/CCI-24-00230. Epub 2025 Mar 20.

Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine.定制大语言模型提高准确性：将检索增强生成和人工智能代理与非定制模型在循证医学方面进行比较

Arthroscopy. 2025 Mar;41(3):565-573.e6. doi: 10.1016/j.arthro.2024.10.042. Epub 2024 Nov 7.

The interaction of structured data using openEHR and large Language models for clinical decision support in prostate cancer.使用openEHR结构化数据与大语言模型在前列腺癌临床决策支持中的交互。

World J Urol. 2025 Jan 13;43(1):67. doi: 10.1007/s00345-024-05423-1.

Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study.大语言模型在韩国牙科执照考试中的表现：一项比较研究。

Int Dent J. 2025 Feb;75(1):176-184. doi: 10.1016/j.identj.2024.09.002. Epub 2024 Oct 6.

Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery.在耳鼻喉科、头颈外科中，评估本地运行和基于网络的大语言模型与人类委员会建议的决策情况。

Eur Arch Otorhinolaryngol. 2025 Mar;282(3):1593-1607. doi: 10.1007/s00405-024-09153-3. Epub 2025 Jan 10.

Comparative Analysis of Large Language Models and Spine Surgeons in Surgical Decision-Making and Radiological Assessment for Spine Pathologies.大语言模型与脊柱外科医生在脊柱疾病手术决策和放射学评估中的比较分析

World Neurosurg. 2025 Feb;194:123531. doi: 10.1016/j.wneu.2024.11.114. Epub 2024 Dec 23.

Evaluation of Large Language Models in Tailoring Educational Content for Cancer Survivors and Their Caregivers: Quality Analysis.大型语言模型在为癌症幸存者及其护理人员量身定制教育内容方面的评估：质量分析

JMIR Cancer. 2025 Apr 7;11:e67914. doi: 10.2196/67914.

Utilizing large language models for gastroenterology research: a conceptual framework.利用大语言模型进行胃肠病学研究：一个概念框架。

Therap Adv Gastroenterol. 2025 Apr 1;18:17562848251328577. doi: 10.1177/17562848251328577. eCollection 2025.

Exploring the role of artificial intelligence, large language models: Comparing patient-focused information and clinical decision support capabilities to the gynecologic oncology guidelines.探索人工智能、大语言模型的作用：将以患者为中心的信息和临床决策支持能力与妇科肿瘤学指南进行比较。

Int J Gynaecol Obstet. 2025 Feb;168(2):419-427. doi: 10.1002/ijgo.15869. Epub 2024 Aug 20.

Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE.语义临床人工智能与原生大语言模型在美国医师执照考试中的表现对比

JAMA Netw Open. 2025 Apr 1;8(4):e256359. doi: 10.1001/jamanetworkopen.2025.6359.

引用本文的文献

Accuracy of ChatGPT, Gemini, Copilot, and Claude to Blepharoplasty-Related Questions.ChatGPT、Gemini、Copilot和Claude对双眼皮手术相关问题的回答准确性。

Aesthetic Plast Surg. 2025 Jul 21. doi: 10.1007/s00266-025-05071-9.

本文引用的文献

Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review.随机对照试验评估人工智能在临床实践中的应用：范围综述。

Lancet Digit Health. 2024 May;6(5):e367-e373. doi: 10.1016/S2589-7500(24)00047-5.

Ethical and regulatory challenges of large language models in medicine.医学领域大型语言模型的伦理和监管挑战。

Lancet Digit Health. 2024 Jun;6(6):e428-e432. doi: 10.1016/S2589-7500(24)00061-X. Epub 2024 Apr 23.

Assessing the research landscape and clinical utility of large language models: a scoping review.评估大型语言模型的研究现状和临床实用性：范围综述。

BMC Med Inform Decis Mak. 2024 Mar 12;24(1):72. doi: 10.1186/s12911-024-02459-6.

AI-Generated Clinical Summaries Require More Than Accuracy.人工智能生成的临床总结需要的不仅仅是准确性。

JAMA. 2024 Feb 27;331(8):637-638. doi: 10.1001/jama.2024.0555.

Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications.放射科中的聊天机器人和大型语言模型：临床和研究应用的实用入门指南。

Radiology. 2024 Jan;310(1):e232756. doi: 10.1148/radiol.232756.

Measuring the Impact of AI in the Diagnosis of Hospitalized Patients: A Randomized Clinical Vignette Survey Study.测量人工智能在住院患者诊断中的影响：一项随机临床病例调查研究。

JAMA. 2023 Dec 19;330(23):2275-2284. doi: 10.1001/jama.2023.22295.

Early breast cancer: ESMO Clinical Practice Guideline for diagnosis, treatment and follow-up.早期乳腺癌：ESMO 诊断、治疗及随访临床实践指南

Ann Oncol. 2024 Feb;35(2):159-182. doi: 10.1016/j.annonc.2023.11.016. Epub 2023 Dec 13.

Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.医学专业人员的新兴技能：提示工程教程

J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Large language models encode clinical knowledge.大语言模型编码临床知识。

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大语言模型作为肿瘤学决策工具：比较人工智能建议与专家推荐

Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations.

作者信息

机构信息

出版信息

PURPOSE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献