Institute for Digital Medicine, Philipps-University Marburg, Marburg, Germany.
Department of Gynecology and Obstetrics, Philipps-University Marburg, Marburg, Germany.
Arch Gynecol Obstet. 2024 Jul;310(1):537-550. doi: 10.1007/s00404-024-07565-4. Epub 2024 May 29.
PURPOSE: This study investigated the concordance of five different publicly available Large Language Models (LLM) with the recommendations of a multidisciplinary tumor board regarding treatment recommendations for complex breast cancer patient profiles. METHODS: Five LLM, including three versions of ChatGPT (version 4 and 3.5, with data access until September 3021 and January 2022), Llama2, and Bard were prompted to produce treatment recommendations for 20 complex breast cancer patient profiles. LLM recommendations were compared to the recommendations of a multidisciplinary tumor board (gold standard), including surgical, endocrine and systemic treatment, radiotherapy, and genetic testing therapy options. RESULTS: GPT4 demonstrated the highest concordance (70.6%) for invasive breast cancer patient profiles, followed by GPT3.5 September 2021 (58.8%), GPT3.5 January 2022 (41.2%), Llama2 (35.3%) and Bard (23.5%). Including precancerous lesions of ductal carcinoma in situ, the identical ranking was reached with lower overall concordance for each LLM (GPT4 60.0%, GPT3.5 September 2021 50.0%, GPT3.5 January 2022 35.0%, Llama2 30.0%, Bard 20.0%). GPT4 achieved full concordance (100%) for radiotherapy. Lowest alignment was reached in recommending genetic testing, demonstrating a varying concordance (55.0% for GPT3.5 January 2022, Llama2 and Bard up to 85.0% for GPT4). CONCLUSION: This early feasibility study is the first to compare different LLM in breast cancer care with regard to changes in accuracy over time, i.e., with access to more data or through technological upgrades. Methodological advancement, i.e., the optimization of prompting techniques, and technological development, i.e., enabling data input control and secure data processing, are necessary in the preparation of large-scale and multicenter studies to provide evidence on their safe and reliable clinical application. At present, safe and evidenced use of LLM in clinical breast cancer care is not yet feasible.
目的:本研究旨在探讨五种不同的开源大型语言模型(LLM)与多学科肿瘤委员会的建议在复杂乳腺癌患者治疗方案方面的一致性。
方法:对 5 种 LLM(包括 ChatGPT 的 3 个版本[GPT4、GPT3.5(数据访问截至 2021 年 9 月和 2022 年 1 月)]、Llama2 和 Bard)提示生成 20 例复杂乳腺癌患者的治疗建议。将 LLM 建议与多学科肿瘤委员会(黄金标准)的建议进行比较,包括手术、内分泌和全身治疗、放疗和基因检测治疗选择。
结果:在浸润性乳腺癌患者的治疗方案方面,GPT4 的一致性最高(70.6%),其次是 GPT3.5(2021 年 9 月)(58.8%)、GPT3.5(2022 年 1 月)(41.2%)、Llama2(35.3%)和 Bard(23.5%)。包括导管原位癌的癌前病变时,每个 LLM 的总体一致性均较低,且排名相同(GPT4 为 60.0%,GPT3.5(2021 年 9 月)为 50.0%,GPT3.5(2022 年 1 月)为 35.0%,Llama2 为 30.0%,Bard 为 20.0%)。GPT4 对放疗的建议完全一致(100%)。在推荐基因检测方面,一致性最低,表现出不同的一致性(GPT3.5(2022 年 1 月)为 55.0%,Llama2 和 Bard 为 85.0%)。
结论:本研究是首个比较不同 LLM 在乳腺癌治疗方面的准确性变化的可行性研究,即随着数据量的增加或技术升级,其准确性会发生变化。方法学的进步,即提示技术的优化,以及技术的发展,即能够控制数据输入和安全的数据处理,是在准备大规模和多中心研究时必要的,以提供关于其安全可靠的临床应用的证据。目前,在临床乳腺癌护理中安全可靠地使用 LLM 尚不可行。
Arch Gynecol Obstet. 2024-7
Front Med (Lausanne). 2024-6-20
J Am Med Inform Assoc. 2024-10-1
Acta Obstet Gynecol Scand. 2025-7
Unfallchirurgie (Heidelb). 2025-5-12
Mayo Clin Proc Digit Health. 2024-11-29
Lancet Digit Health. 2023-12
JAMA Health Forum. 2023-9-1
J Am Acad Orthop Surg. 2023-12-1