Suppr超能文献

大语言模型作为肿瘤学决策工具:比较人工智能建议与专家推荐

Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations.

作者信息

Ah-Thiane Loic, Heudel Pierre-Etienne, Campone Mario, Robert Marie, Brillaud-Meflah Victoire, Rousseau Caroline, Le Blanc-Onfroy Magali, Tomaszewski Florine, Supiot Stéphane, Perennec Tanguy, Mervoyer Augustin, Frenel Jean-Sébastien

机构信息

Department of Radiotherapy, ICO Rene Gauducheau, Saint-Herblain, France.

Department of Medical Oncology, Center Léon Bérard, Lyon, France.

出版信息

JCO Clin Cancer Inform. 2025 Mar;9:e2400230. doi: 10.1200/CCI-24-00230. Epub 2025 Mar 20.

Abstract

PURPOSE

To determine the accuracy of large language models (LLMs) in generating appropriate treatment options for patients with early breast cancer on the basis of their medical records.

MATERIALS AND METHODS

Retrospective study using anonymized medical records of patients with BC presented during multidisciplinary team meetings (MDTs) between January and April 2024. Three generalist artificial intelligence models (Claude3-Opus, GPT4-Turbo, and LLaMa3-70B) were used to generate treatment suggestions, which were compared with experts' decisions. The primary outcome was the rate of appropriate suggestions from the LLMs, compared with the reference experts' decisions. The secondary outcome was the LLMs' performances (F1 score and specificity) in generating appropriate suggestions for each treatment category.

RESULTS

The rates of appropriate suggestions were 86.6% (97/112), 85.7% (96/112), and 75.0% (84/112) for Claude3-Opus, GPT4-Turbo, and LLaMa3-70B, respectively. No significant difference was found between Claude3-Opus and GPT4-Turbo ( = .85), but both tended to perform better than LLaMa3-70B ( = .027 and = .043, respectively). LLMs showed high accuracy for adjuvant endocrine therapy and targeted therapy indications. However, they tended to overestimate the need for adjuvant radiotherapy and had variable performances in suggesting adjuvant chemotherapy and genomic tests.

CONCLUSION

LLMs, particularly Claude3-Opus and GPT4-Turbo, demonstrated promising accuracy in suggesting appropriate adjuvant treatments for patients with early BC on the basis of their medical records. Although LLMs showed limitations in validating surgery and indicating genomic tests, their performance in other treatment modalities highlights their potential to automate and augment decision making during MDTs. Further studies with fine-tuned LLMs and a prospective design are needed to demonstrate their utility in clinical practice.

摘要

目的

基于早期乳腺癌患者的病历,确定大语言模型(LLMs)生成合适治疗方案的准确性。

材料与方法

采用回顾性研究,使用2024年1月至4月多学科团队会议(MDTs)期间呈现的乳腺癌患者匿名病历。使用三种通用人工智能模型(Claude3-Opus、GPT4-Turbo和LLaMa3-70B)生成治疗建议,并与专家的决策进行比较。主要结果是与参考专家决策相比,大语言模型给出合适建议的比例。次要结果是大语言模型在为每个治疗类别生成合适建议时的表现(F1分数和特异性)。

结果

Claude3-Opus、GPT4-Turbo和LLaMa3-70B给出合适建议的比例分别为86.6%(97/112)、85.7%(96/112)和75.0%(84/112)。Claude3-Opus和GPT4-Turbo之间未发现显著差异(P = 0.85),但两者的表现均优于LLaMa3-70B(P分别为0.027和0.043)。大语言模型在辅助内分泌治疗和靶向治疗适应症方面显示出较高的准确性。然而,它们往往高估了辅助放疗的必要性,并且在建议辅助化疗和基因检测方面表现不一。

结论

大语言模型,特别是Claude3-Opus和GPT4-Turbo,在基于早期乳腺癌患者病历建议合适的辅助治疗方面显示出有前景的准确性。尽管大语言模型在验证手术和指示基因检测方面存在局限性,但其在其他治疗方式中的表现凸显了它们在多学科团队会议期间实现决策自动化和增强决策的潜力。需要对经过微调的大语言模型进行进一步的前瞻性研究,以证明它们在临床实践中的效用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fc8/11949217/31a718fbf96b/cci-9-e2400230-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验