Institute for Digital Medicine, University Hospital Giessen and Marburg, Philipps-University Marburg, Marburg, Germany.
Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Palo Alto, CA, USA.
J Cancer Res Clin Oncol. 2024 Oct 9;150(10):451. doi: 10.1007/s00432-024-05964-3.
Large language models (LLM) show potential for decision support in breast cancer care. Their use in clinical care is currently prohibited by lack of control over sources used for decision-making, explainability of the decision-making process and health data security issues. Recent development of Small Language Models (SLM) is discussed to address these challenges. This preclinical proof-of-concept study tailors an open-source SLM to the German breast cancer guideline (BC-SLM) to evaluate initial clinical accuracy and technical functionality in a preclinical simulation.
A multidisciplinary tumor board (MTB) is used as the gold-standard to assess the initial clinical accuracy in terms of concordance of the BC-SLM with MTB and comparing it to two publicly available LLM, ChatGPT3.5 and 4. The study includes 20 fictional patient profiles and recommendations for 5 treatment modalities, resulting in 100 binary treatment recommendations (recommended or not recommended). Statistical evaluation includes concordance with MTB in % including Cohen's Kappa statistic (κ). Technical functionality is assessed qualitatively in terms of local hosting, adherence to the guideline and information retrieval.
The overall concordance amounts to 86% for BC-SLM (κ = 0.721, p < 0.001), 90% for ChatGPT4 (κ = 0.820, p < 0.001) and 83% for ChatGPT3.5 (κ = 0.661, p < 0.001). Specific concordance for each treatment modality ranges from 65 to 100% for BC-SLM, 85-100% for ChatGPT4, and 55-95% for ChatGPT3.5. The BC-SLM is locally functional, adheres to the standards of the German breast cancer guideline and provides referenced sections for its decision-making.
The tailored BC-SLM shows initial clinical accuracy and technical functionality, with concordance to the MTB that is comparable to publicly-available LLMs like ChatGPT4 and 3.5. This serves as a proof-of-concept for adapting a SLM to an oncological disease and its guideline to address prevailing issues with LLM by ensuring decision transparency, explainability, source control, and data security, which represents a necessary step towards clinical validation and safe use of language models in clinical oncology.
大型语言模型(LLM)在乳腺癌护理决策支持方面具有潜力。由于缺乏对决策使用的来源、决策过程的可解释性和健康数据安全性问题的控制,它们目前在临床护理中的使用受到限制。最近讨论了小型语言模型(SLM)的发展,以解决这些挑战。本临床前概念验证研究根据德国乳腺癌指南(BC-SLM)对开源 SLM 进行了调整,以在临床前模拟中评估其初始临床准确性和技术功能。
多学科肿瘤委员会(MTB)作为金标准,根据 BC-SLM 与 MTB 的一致性评估初始临床准确性,并将其与两个公开可用的 LLM(ChatGPT3.5 和 4)进行比较。该研究包括 20 个虚构的患者概况和 5 种治疗方式的建议,产生了 100 个二进制治疗建议(推荐或不推荐)。统计评估包括与 MTB 的一致性,以百分比表示,并包括 Cohen's Kappa 统计量(κ)。技术功能从本地托管、遵守指南和信息检索的角度进行定性评估。
BC-SLM 的总体一致性为 86%(κ=0.721,p<0.001),ChatGPT4 的一致性为 90%(κ=0.820,p<0.001),ChatGPT3.5 的一致性为 83%(κ=0.661,p<0.001)。BC-SLM 对每种治疗方式的具体一致性范围为 65%至 100%,ChatGPT4 为 85%至 100%,ChatGPT3.5 为 55%至 95%。BC-SLM 具有本地功能,符合德国乳腺癌指南的标准,并为其决策提供参考部分。
经过调整的 BC-SLM 显示出初始临床准确性和技术功能,与 MTB 的一致性与公开可用的 LLM(如 ChatGPT4 和 3.5)相当。这为适应 SLM 以满足 LLM 存在的问题提供了概念验证,包括确保决策透明度、可解释性、来源控制和数据安全性,这是临床验证和安全使用语言模型在临床肿瘤学中的必要步骤。