Department of Nursing, Jinzhou Medical University, Jinzhou.
Department of Clinical Trials.
Int J Surg. 2024 Apr 1;110(4):1941-1950. doi: 10.1097/JS9.0000000000001066.
Large language models (LLMs) have garnered significant attention in the AI domain owing to their exemplary context recognition and response capabilities. However, the potential of LLMs in specific clinical scenarios, particularly in breast cancer diagnosis, treatment, and care, has not been fully explored. This study aimed to compare the performances of three major LLMs in the clinical context of breast cancer.
In this study, clinical scenarios designed specifically for breast cancer were segmented into five pivotal domains (nine cases): assessment and diagnosis, treatment decision-making, postoperative care, psychosocial support, and prognosis and rehabilitation. The LLMs were used to generate feedback for various queries related to these domains. For each scenario, a panel of five breast cancer specialists, each with over a decade of experience, evaluated the feedback from LLMs. They assessed feedback concerning LLMs in terms of their quality, relevance, and applicability.
There was a moderate level of agreement among the raters (Fleiss' kappa=0.345, P<0.05). Comparing the performance of different models regarding response length, GPT-4.0 and GPT-3.5 provided relatively longer feedback than Claude2. Furthermore, across the nine case analyses, GPT-4.0 significantly outperformed the other two models in average quality, relevance, and applicability. Within the five clinical areas, GPT-4.0 markedly surpassed GPT-3.5 in the quality of the other four areas and scored higher than Claude2 in tasks related to psychosocial support and treatment decision-making.
This study revealed that in the realm of clinical applications for breast cancer, GPT-4.0 showcases not only superiority in terms of quality and relevance but also demonstrates exceptional capability in applicability, especially when compared to GPT-3.5. Relative to Claude2, GPT-4.0 holds advantages in specific domains. With the expanding use of LLMs in the clinical field, ongoing optimization and rigorous accuracy assessments are paramount.
大型语言模型 (LLM) 因其出色的上下文识别和响应能力而在人工智能领域引起了广泛关注。然而,它们在特定临床场景中的潜力,特别是在乳腺癌诊断、治疗和护理方面,尚未得到充分探索。本研究旨在比较三种主要的 LLM 在乳腺癌临床环境中的性能。
在这项研究中,专门为乳腺癌设计的临床场景被分为五个关键领域(九个案例):评估和诊断、治疗决策、术后护理、心理社会支持以及预后和康复。使用 LLM 为这些领域的各种查询生成反馈。对于每个场景,一个由五名乳腺癌专家组成的小组,每个专家都有超过十年的经验,评估了 LLM 的反馈。他们根据质量、相关性和适用性来评估 LLM 的反馈。
评分者之间存在中度一致性(Fleiss' kappa=0.345,P<0.05)。比较不同模型在响应长度方面的性能,GPT-4.0 和 GPT-3.5 提供的反馈相对较长。此外,在九个案例分析中,GPT-4.0 在平均质量、相关性和适用性方面明显优于其他两个模型。在五个临床领域中,GPT-4.0 在除了治疗决策之外的其他四个领域的质量方面明显优于 GPT-3.5,并且在与心理社会支持和治疗决策相关的任务中得分高于 Claude2。
本研究表明,在乳腺癌的临床应用领域,GPT-4.0 不仅在质量和相关性方面具有优势,而且在适用性方面表现出色,特别是与 GPT-3.5 相比。与 Claude2 相比,GPT-4.0 在特定领域具有优势。随着 LLM 在临床领域的广泛应用,不断优化和严格的准确性评估至关重要。