比较ChatGPT-3.5和ChatGPT-4与德国成人软组织肉瘤循证S3指南的一致性。

Comparing ChatGPT-3.5 and ChatGPT-4's alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma.

作者信息

Li Cheng-Peng, Jakob Jens, Menge Franka, Reißfelder Christoph, Hohenberger Peter, Yang Cui

机构信息

Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Sarcoma Center, Peking University Cancer Hospital & Institute, Beijing, China.

Department of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany.

出版信息

iScience. 2024 Nov 28;27(12):111493. doi: 10.1016/j.isci.2024.111493. eCollection 2024 Dec 20.

DOI:10.1016/j.isci.2024.111493

PMID:39759026

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11699281/

Abstract

Clinical reliability assessment of large language models is necessary due to their increasing use in healthcare. This study assessed the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions deducted from the German evidence-based S3 guideline for adult soft tissue sarcoma (STS). Reponses to 80 complex clinical questions covering diagnosis, treatment, and surveillance aspects were independently scored by two sarcoma experts for accuracy and adequacy. ChatGPT-4 outperformed ChatGPT-3.5 overall, with higher median scores in both accuracy (5.5 vs. 5.0) and adequacy (5.0 vs. 4.0). While both versions performed similarly on questions about retroperitoneal/visceral sarcoma and gastrointestinal stromal tumor (GIST)-specific treatment as well as questions about surveillance, ChatGPT-4 performed better on questions about general STS treatment and extremity/trunk sarcomas. Despite their potential as a supportive tool, both models occasionally offered misleading and potentially life-threatening information. This underscores the significance of cautious adoption and human monitoring in clinical settings.

摘要

由于大语言模型在医疗保健领域的使用日益增加，对其进行临床可靠性评估很有必要。本研究评估了ChatGPT-3.5和ChatGPT-4在回答从德国成人软组织肉瘤（STS）循证S3指南中摘录的问题时的表现。两位肉瘤专家对80个涵盖诊断、治疗和监测方面的复杂临床问题的回答进行了准确性和充分性的独立评分。ChatGPT-4总体表现优于ChatGPT-3.5，在准确性（中位数5.5对5.0）和充分性（中位数5.0对4.0）方面得分更高。虽然两个版本在关于腹膜后/内脏肉瘤和胃肠道间质瘤（GIST）特异性治疗的问题以及监测问题上表现相似，但ChatGPT-4在关于一般STS治疗和四肢/躯干肉瘤的问题上表现更好。尽管它们有作为辅助工具的潜力，但两个模型偶尔都会提供误导性和潜在危及生命的信息。这凸显了在临床环境中谨慎采用和人工监测的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/31ff/11699281/d29472dab53e/fx1.jpg

相似文献

Comparing ChatGPT-3.5 and ChatGPT-4's alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma.比较ChatGPT-3.5和ChatGPT-4与德国成人软组织肉瘤循证S3指南的一致性。

iScience. 2024 Nov 28;27(12):111493. doi: 10.1016/j.isci.2024.111493. eCollection 2024 Dec 20.

Evaluation of ChatGPT-4 Performance in Answering Patients' Questions About the Management of Type 2 Diabetes.评估ChatGPT-4在回答患者关于2型糖尿病管理问题方面的表现。

Sisli Etfal Hastan Tip Bul. 2024 Dec 24;58(4):483-490. doi: 10.14744/SEMB.2024.23697. eCollection 2024.

ChatGPT Earns American Board Certification in Hand Surgery.ChatGPT 获得美国手部外科委员会认证。

Hand Surg Rehabil. 2024 Jun;43(3):101688. doi: 10.1016/j.hansur.2024.101688. Epub 2024 Mar 27.

ChatGPT-3.5 and -4 provide mostly accurate information when answering patients' questions relating to femoroacetabular impingement syndrome and arthroscopic hip surgery.ChatGPT-3.5和ChatGPT-4在回答患者有关股骨髋臼撞击综合征和关节镜髋关节手术的问题时，提供的信息大多是准确的。

J ISAKOS. 2025 Feb;10:100376. doi: 10.1016/j.jisako.2024.100376. Epub 2024 Dec 12.

Evaluating the novel role of ChatGPT-4 in addressing corneal ulcer queries: An AI-powered insight.评估ChatGPT-4在解答角膜溃疡相关问题方面的新作用：基于人工智能的见解。

Eur J Ophthalmol. 2025 Apr 28:11206721251337290. doi: 10.1177/11206721251337290.

A Multidisciplinary Assessment of ChatGPT's Knowledge of Amyloidosis: Observational Study.对ChatGPT关于淀粉样变性知识的多学科评估：观察性研究。

JMIR Cardio. 2024 Apr 19;8:e53421. doi: 10.2196/53421.

Evaluating the performance and clinical decision-making impact of ChatGPT-4 in reproductive medicine.评估ChatGPT-4在生殖医学中的性能及对临床决策的影响。

Int J Gynaecol Obstet. 2025 Mar;168(3):1285-1291. doi: 10.1002/ijgo.15959. Epub 2024 Nov 11.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology.在放射肿瘤学培训考试和《红杂志》灰色地带病例上对ChatGPT-4进行基准测试：人工智能辅助放射肿瘤学医学教育和决策的潜力与挑战

Front Oncol. 2023 Sep 14;13:1265024. doi: 10.3389/fonc.2023.1265024. eCollection 2023.

Performance of ChatGPT 3.5 and 4 on U.S. dental examinations: the INBDE, ADAT, and DAT.ChatGPT 3.5和4在美国牙科考试中的表现：国际牙科执照考试（INBDE）、高级牙科能力倾向测试（ADAT）和牙科入学考试（DAT）

Imaging Sci Dent. 2024 Sep;54(3):271-275. doi: 10.5624/isd.20240037. Epub 2024 Jul 2.

引用本文的文献

The imitation game: large language models versus multidisciplinary tumor boards: benchmarking AI against 21 sarcoma centers from the ring trial.模仿游戏：大语言模型与多学科肿瘤专家委员会：将人工智能与环试验中的21个肉瘤中心进行对比测试

J Cancer Res Clin Oncol. 2025 Sep 10;151(9):248. doi: 10.1007/s00432-025-06304-9.

The Role of Artificial Intelligence (ChatGPT-4o) in Supporting Tumor Board Decisions.人工智能（ChatGPT-4o）在辅助肿瘤专家委员会决策中的作用

J Clin Med. 2025 May 18;14(10):3535. doi: 10.3390/jcm14103535.

Evaluating ChatGPT-4o as a decision support tool in multidisciplinary sarcoma tumor boards: heterogeneous performance across various specialties.评估ChatGPT-4o作为多学科肉瘤肿瘤委员会决策支持工具的效果：各专业表现参差不齐

Front Oncol. 2025 Jan 17;14:1526288. doi: 10.3389/fonc.2024.1526288. eCollection 2024.

本文引用的文献

A large language model-based clinical decision support system for syncope recognition in the emergency department: A framework for clinical workflow integration.一种基于大语言模型的急诊科晕厥识别临床决策支持系统：临床工作流程整合框架。

Eur J Intern Med. 2025 Jan;131:113-120. doi: 10.1016/j.ejim.2024.09.017. Epub 2024 Sep 28.

Machine learning-based individualized survival prediction model for prognosis in osteosarcoma: Data from the SEER database.基于机器学习的骨肉瘤个体化生存预测模型：来自 SEER 数据库的数据。

Medicine (Baltimore). 2024 Sep 27;103(39):e39582. doi: 10.1097/MD.0000000000039582.

ChatGPT Generated Otorhinolaryngology Multiple-Choice Questions: Quality, Psychometric Properties, and Suitability for Assessments.ChatGPT生成的耳鼻咽喉科多项选择题：质量、心理测量特性及评估适用性

OTO Open. 2024 Sep 26;8(3):e70018. doi: 10.1002/oto2.70018. eCollection 2024 Jul-Sep.

Enhancing Medical Interview Skills Through AI-Simulated Patient Interactions: Nonrandomized Controlled Trial.通过人工智能模拟患者交互增强医学访谈技巧：非随机对照试验。

JMIR Med Educ. 2024 Sep 23;10:e58753. doi: 10.2196/58753.

Artificial intelligence and machine learning applications for the imaging of bone and soft tissue tumors.用于骨与软组织肿瘤成像的人工智能和机器学习应用。

Front Radiol. 2024 Sep 5;4:1332535. doi: 10.3389/fradi.2024.1332535. eCollection 2024.

KNowNEt:Guided Health Information Seeking from LLMs via Knowledge Graph Integration.KNowNEt：通过知识图谱集成从大型语言模型中引导健康信息检索。

IEEE Trans Vis Comput Graph. 2025 Jan;31(1):547-557. doi: 10.1109/TVCG.2024.3456364. Epub 2024 Dec 3.

Evaluating the Alignment of Artificial Intelligence-Generated Recommendations With Clinical Guidelines Focused on Soft Tissue Tumors.评估人工智能生成的建议与聚焦软组织肿瘤的临床指南的一致性。

J Surg Oncol. 2025 Feb;131(2):285-290. doi: 10.1002/jso.27874. Epub 2024 Sep 5.

Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2.评估大语言模型在乳腺癌临床场景中的应用：基于 ChatGPT-3.5、ChatGPT-4.0 和 Claude2 的比较分析

Int J Surg. 2024 Apr 1;110(4):1941-1950. doi: 10.1097/JS9.0000000000001066.

From Bytes to Best Practices: Tracing ChatGPT-3.5's Evolution and Alignment With the National Comprehensive Cancer Network® Guidelines in Pancreatic Adenocarcinoma Management.从字节到最佳实践：追踪 ChatGPT-3.5 在胰腺腺癌管理方面与国家综合癌症网络®指南的演变和一致性。

Am Surg. 2024 Oct;90(10):2543-2547. doi: 10.1177/00031348241248801. Epub 2024 Apr 26.

Artificial intelligence large language model ChatGPT: is it a trustworthy and reliable source of information for sarcoma patients?人工智能大语言模型 ChatGPT：它是肉瘤患者值得信赖和可靠的信息来源吗？

Front Public Health. 2024 Mar 22;12:1303319. doi: 10.3389/fpubh.2024.1303319. eCollection 2024.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

比较ChatGPT-3.5和ChatGPT-4与德国成人软组织肉瘤循证S3指南的一致性。

Comparing ChatGPT-3.5 and ChatGPT-4's alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献