Delourme Solène, Redjdal Akram, Bouaud Jacques, Seroussi Brigitte
Sorbonne Université, Université Sorbonne Paris Nord, INSERM, LIMICS, Paris, France.
EPITA, Paris, France.
Methods Inf Med. 2024 Sep;63(3-04):85-96. doi: 10.1055/a-2528-4299. Epub 2025 Jan 29.
Multidisciplinary tumor boards (MTBs) have been established in most countries to allow experts collaboratively determine the best treatment decisions for cancer patients. However, MTBs often face challenges such as case overload, which can compromise MTB decision quality. Clinical decision support systems (CDSSs) have been introduced to assist clinicians in this process. Despite their potential, CDSSs are still underutilized in routine practice. The emergence of large language models (LLMs), such as ChatGPT, offers new opportunities to improve the efficiency and usability of traditional CDSSs.
OncoDoc2 is a guideline-based CDSS developed using a documentary approach and applied to breast cancer management. This study aims to evaluate the potential of LLMs, used as question-answering (QA) systems, to improve the usability of OncoDoc2 across different prompt engineering techniques (PETs).
Data extracted from breast cancer patient summaries (BCPSs), together with questions formulated by OncoDoc2, were used to create prompts for various LLMs, and several PETs were designed and tested. Using a sample of 200 randomized BCPSs, LLMs and PETs were initially compared with regard to their responses to OncoDoc2 questions using classic metrics (accuracy, precision, recall, and F1 score). Best performing LLMs and PETs were further assessed by comparing the therapeutic recommendations generated by OncoDoc2, based on LLM inputs, to those provided by MTB clinicians using OncoDoc2. Finally, the best performing method was validated using a new sample of 30 randomized BCPSs.
The combination of Mistral and OpenChat models under the enhanced Zero-Shot PET showed the best performance as a question-answering system. This approach gets a precision of 60.16%, a recall of 54.18%, an F1 score of 56.59%, and an accuracy of 75.57% on the validation set of 30 BCPSs. However, this approach yielded poor results as a CDSS, with only 16.67% of the recommendations generated by OncoDoc2 based on LLM inputs matching the gold standard.
All the criteria in the OncoDoc2 decision tree are crucial for capturing the uniqueness of each patient. Any deviation from a criterion alters the recommendations generated. Despite achieving a good accuracy rate of 75.57%, LLMs still face challenges in reliably understanding complex medical contexts and be effective as CDSSs.
大多数国家都已设立多学科肿瘤委员会(MTB),以便专家共同为癌症患者确定最佳治疗决策。然而,MTB常常面临病例过多等挑战,这可能会影响MTB的决策质量。临床决策支持系统(CDSS)已被引入以协助临床医生进行这一过程。尽管CDSS具有潜力,但在常规实践中仍未得到充分利用。诸如ChatGPT等大语言模型(LLM)的出现为提高传统CDSS的效率和可用性提供了新机遇。
OncoDoc2是一种基于指南开发的CDSS,采用文献法开发并应用于乳腺癌管理。本研究旨在评估用作问答(QA)系统的LLM在不同提示工程技术(PET)下提高OncoDoc2可用性的潜力。
从乳腺癌患者摘要(BCPS)中提取的数据,以及OncoDoc2提出的问题,被用于为各种LLM创建提示,并设计和测试了几种PET。使用200个随机抽取的BCPS样本,最初使用经典指标(准确率、精确率、召回率和F1分数)比较LLM和PET对OncoDoc2问题的回答。通过比较基于LLM输入由OncoDoc2生成的治疗建议与MTB临床医生使用OncoDoc2提供的建议,进一步评估表现最佳的LLM和PET。最后,使用30个随机抽取的BCPS新样本对表现最佳的方法进行验证。
在增强型零样本PET下,Mistral和OpenChat模型的组合作为问答系统表现最佳。在30个BCPS的验证集上,这种方法的精确率为60.16%,召回率为54.18%,F1分数为56.59%,准确率为75.57%。然而,作为CDSS,这种方法产生的结果很差,基于LLM输入由OncoDoc2生成的建议中只有16.67%与金标准匹配。
OncoDoc2决策树中的所有标准对于把握每个患者的独特性都至关重要。任何偏离标准的情况都会改变生成的建议。尽管LLM达到了75.57%的良好准确率,但在可靠理解复杂医学背景并有效作为CDSS方面仍面临挑战。