Li Cheng-Peng, Jakob Jens, Menge Franka, Reißfelder Christoph, Hohenberger Peter, Yang Cui
Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Sarcoma Center, Peking University Cancer Hospital & Institute, Beijing, China.
Department of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany.
iScience. 2024 Nov 28;27(12):111493. doi: 10.1016/j.isci.2024.111493. eCollection 2024 Dec 20.
Clinical reliability assessment of large language models is necessary due to their increasing use in healthcare. This study assessed the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions deducted from the German evidence-based S3 guideline for adult soft tissue sarcoma (STS). Reponses to 80 complex clinical questions covering diagnosis, treatment, and surveillance aspects were independently scored by two sarcoma experts for accuracy and adequacy. ChatGPT-4 outperformed ChatGPT-3.5 overall, with higher median scores in both accuracy (5.5 vs. 5.0) and adequacy (5.0 vs. 4.0). While both versions performed similarly on questions about retroperitoneal/visceral sarcoma and gastrointestinal stromal tumor (GIST)-specific treatment as well as questions about surveillance, ChatGPT-4 performed better on questions about general STS treatment and extremity/trunk sarcomas. Despite their potential as a supportive tool, both models occasionally offered misleading and potentially life-threatening information. This underscores the significance of cautious adoption and human monitoring in clinical settings.
由于大语言模型在医疗保健领域的使用日益增加,对其进行临床可靠性评估很有必要。本研究评估了ChatGPT-3.5和ChatGPT-4在回答从德国成人软组织肉瘤(STS)循证S3指南中摘录的问题时的表现。两位肉瘤专家对80个涵盖诊断、治疗和监测方面的复杂临床问题的回答进行了准确性和充分性的独立评分。ChatGPT-4总体表现优于ChatGPT-3.5,在准确性(中位数5.5对5.0)和充分性(中位数5.0对4.0)方面得分更高。虽然两个版本在关于腹膜后/内脏肉瘤和胃肠道间质瘤(GIST)特异性治疗的问题以及监测问题上表现相似,但ChatGPT-4在关于一般STS治疗和四肢/躯干肉瘤的问题上表现更好。尽管它们有作为辅助工具的潜力,但两个模型偶尔都会提供误导性和潜在危及生命的信息。这凸显了在临床环境中谨慎采用和人工监测的重要性。