Li Cheng-Peng, Kalisa Aimé Terence, Roohani Siyer, Hummedah Kamal, Menge Franka, Reißfelder Christoph, Albertsmeier Markus, Kasper Bernd, Jakob Jens, Yang Cui
Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Sarcoma Center, Peking University Cancer Hospital & Institute, Beijing, China.
Department of Surgery, Mannheim School of Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany.
J Cancer Res Clin Oncol. 2025 Sep 10;151(9):248. doi: 10.1007/s00432-025-06304-9.
The study aims to compare the treatment recommendations generated by four leading large language models (LLMs) with those from 21 sarcoma centers' multidisciplinary tumor boards (MTBs) of the sarcoma ring trial in managing complex soft tissue sarcoma (STS) cases.
We simulated STS-MTBs using four LLMs-Llama 3.2-vison: 90b, Claude 3.5 Sonnet, DeepSeek-R1, and OpenAI-o1 across five anonymized STS cases from the sarcoma ring trial. Each model was queried 21 times per case using a standardized prompt, and the responses were compared with human MTBs in terms of intra-model consistency, treatment recommendation alignment, alternative recommendations, and source citation.
LLMs demonstrated high inter-model and intra-model consistency in only 20% of cases, and their recommendations aligned with human consensus in only 20-60% of cases. The model with the highest concordance with the most common MTB recommendation, Claude 3.5 Sonnet, aligned with experts in only 60% of cases. Notably, the recommendations across MTBs were highly heterogenous, contextualizing the variable LLM performance. Discrepancies were particularly notable, where common human recommendations were often absent in LLM outputs. Additionally, the sources for the recommendation rationale of LLMs were clearly derived from the German S3 sarcoma guidelines in only 24.8% to 55.2% of the responses. LLMs occasionally suggested potentially harmful information were also observed in alternative recommendations.
Despite the considerable heterogeneity observed in MTB recommendations, the significant discrepancies and potentially harmful recommendations highlight current AI tools' limitations, underscoring that referral to high-volume sarcoma centers remains essential for optimal patient care. At the same time, LLMs could serve as an excellent tool to prepare for MDT discussions.
本研究旨在比较四种领先的大语言模型(LLM)生成的治疗建议与肉瘤环试验中21个肉瘤中心的多学科肿瘤委员会(MTB)针对复杂软组织肉瘤(STS)病例给出的建议。
我们使用四种LLM——Llama 3.2 - vison: 90b、Claude 3.5 Sonnet、DeepSeek - R1和OpenAI - o1,针对肉瘤环试验中的五个匿名STS病例模拟了STS - MTB。每个模型针对每个病例使用标准化提示进行21次查询,并将回复在模型内一致性、治疗建议一致性、替代建议和来源引用方面与人类MTB进行比较。
LLM仅在20%的病例中表现出较高的模型间和模型内一致性,其建议仅在20% - 60%的病例中与人类共识一致。与最常见的MTB建议一致性最高的模型Claude 3.5 Sonnet,仅在60%的病例中与专家意见一致。值得注意的是,各MTB之间的建议高度异质,这也说明了LLM性能的差异。差异尤为显著,LLM输出中常常没有常见的人类建议。此外,LLM建议理由的来源在仅24.8%至55.2%的回复中明显源自德国S3肉瘤指南。在替代建议中还偶尔观察到LLM给出潜在有害信息的情况。
尽管在MTB建议中观察到相当大的异质性,但显著的差异和潜在有害的建议凸显了当前人工智能工具的局限性,强调了转诊至大型肉瘤中心对于为患者提供最佳护理仍然至关重要。同时,LLM可以作为多学科团队(MDT)讨论准备的优秀工具。