Gungor Nur Dokuzeylul, Esen Fatih Sinan, Tasci Tolga, Gungor Kagan, Cil Kaan
Department of Reproductive Endocrinology and IVF Center BAU, Goztepe Medical Park Hospital, Istanbul, Turkey.
Department of Computer Engineering, Ankara University, Ankara, Turkey.
Oncol Res Treat. 2025;48(3):102-111. doi: 10.1159/000543173. Epub 2024 Dec 17.
The study evaluates the performance of large language model versions of ChatGPT - ChatGPT-3.5, ChatGPT-4, and ChatGPT-Omni - in addressing inquiries related to the diagnosis and treatment of gynecological cancers, including ovarian, endometrial, and cervical cancers.
A total of 804 questions were equally distributed across four categories: true/false, multiple-choice, open-ended, and case-scenario, with each question type representing varying levels of complexity. Performance was assessed using a six-point Likert scale, focusing on accuracy, completeness, and alignment with established clinical guidelines.
For true/false queries, ChatGPT-Omni achieved accuracy rates of 100% for easy, 98% for medium, and 97% for complicated questions, higher than ChatGPT-4 (94%, 90%, 85%) and ChatGPT-3.5 (90%, 85%, 80%) (p = 0.041, 0.023, 0.014, respectively). In multiple-choice, ChatGPT-Omni maintained superior accuracy with 100% for easy, 98% for medium, and 93% for complicated queries, compared to ChatGPT-4 (92%, 88%, 80%) and ChatGPT-3.5 (85%, 80%, 70%) (p = 0.035, 0.028, 0.011). For open-ended questions, ChatGPT-Omni had mean Likert scores of 5.8 for easy, 5.5 for medium, and 5.2 for complex levels, outperforming ChatGPT-4 (5.4, 5.0, 4.5) and ChatGPT-3.5 (5.0, 4.5, 4.0) (p = 0.037, 0.026, 0.015). Similar trends were observed in case-scenario questions, where ChatGPT-Omni achieved scores of 5.6, 5.3, and 4.9 for easy, medium, and hard levels, respectively (p = 0.017, 0.008, 0.012).
ChatGPT-Omni exhibited superior performance in responding to clinical queries related to gynecological cancers, underscoring its potential utility as a decision support tool and an educational resource in clinical practice.
本研究评估了ChatGPT的大语言模型版本——ChatGPT-3.5、ChatGPT-4和ChatGPT-Omni——在回答与妇科癌症(包括卵巢癌、子宫内膜癌和宫颈癌)诊断和治疗相关问题方面的表现。
总共804个问题平均分布在四个类别中:是非题、选择题、开放式问题和病例情景题,每种问题类型代表不同程度的复杂性。使用六点李克特量表评估表现,重点关注准确性、完整性以及与既定临床指南的一致性。
对于是非题查询,ChatGPT-Omni在简单问题上的准确率为100%,中等难度问题为98%,复杂问题为97%,高于ChatGPT-4(94%、90%、85%)和ChatGPT-3.5(90%、85%、80%)(p值分别为0.041、0.023、0.014)。在选择题方面,ChatGPT-Omni保持了较高的准确率,简单问题为100%,中等难度问题为98%,复杂问题为93%,而ChatGPT-4分别为92%、88%、80%,ChatGPT-3.5分别为85%、80%、70%(p值分别为0.035、0.028、0.011)。对于开放式问题,ChatGPT-Omni在简单、中等和复杂难度水平上的平均李克特分数分别为5.8、5.5和5.2,优于ChatGPT-4(5.4、5.0、4.5)和ChatGPT-3.5(5.0、4.5、4.0)(p值分别为0.037、0.026、0.015)。在病例情景问题中也观察到类似趋势,ChatGPT-Omni在简单、中等和困难水平上的得分分别为5.6、5.3和4.9(p值分别为0.017、0.008、0.012)。
ChatGPT-Omni在回答与妇科癌症相关的临床问题方面表现出色,突显了其作为临床实践中的决策支持工具和教育资源的潜在效用。