ChatGPT 与卵巢癌管理的国家指南比较：ChatGPT 是否做对了？- 纪念斯隆凯特琳癌症中心卵巢癌团队研究。

ChatGPT compared to national guidelines for management of ovarian cancer: Did ChatGPT get it right? - A Memorial Sloan Kettering Cancer Center Team Ovary study.

机构信息

Gynecology Service, Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY, USA.

Gynecology Service, Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY, USA; Department of Obstetrics and Gynecology, Weill Cornell Medical College, New York, NY, USA.

出版信息

Gynecol Oncol. 2024 Oct;189:75-79. doi: 10.1016/j.ygyno.2024.07.007. Epub 2024 Jul 22.

DOI:10.1016/j.ygyno.2024.07.007

PMID:39042956

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11402584/

Abstract

OBJECTIVES

We evaluated the performance of a chatbot compared to the National Comprehensive Cancer Network (NCCN) Guidelines for the management of ovarian cancer.

METHODS

Using NCCN Guidelines, we generated 10 questions and answers regarding management of ovarian cancer at a single point in time. Questions were thematically divided into risk factors, surgical management, medical management, and surveillance. We asked ChatGPT (GPT-4) to provide responses without prompting (unprompted GPT) and with prompt engineering (prompted GPT). Responses were blinded and evaluated for accuracy and completeness by 5 gynecologic oncologists. A score of 0 was defined as inaccurate, 1 as accurate and incomplete, and 2 as accurate and complete. Evaluations were compared among NCCN, unprompted GPT, and prompted GPT answers.

RESULTS

Overall, 48% of responses from NCCN, 64% from unprompted GPT, and 66% from prompted GPT were accurate and complete. The percentage of accurate but incomplete responses was higher for NCCN vs GPT-4. The percentage of accurate and complete scores for questions regarding risk factors, surgical management, and surveillance was higher for GPT-4 vs NCCN; however, for questions regarding medical management, the percentage was lower for GPT-4 vs NCCN. Overall, 14% of responses from unprompted GPT, 12% from prompted GPT, and 10% from NCCN were inaccurate.

CONCLUSIONS

GPT-4 provided accurate and complete responses at a single point in time to a limited set of questions regarding ovarian cancer, with best performance in areas of risk factors, surgical management, and surveillance. Occasional inaccuracies, however, should limit unsupervised use of chatbots at this time.

摘要

目的

我们评估了聊天机器人与美国国家综合癌症网络（NCCN）卵巢癌管理指南相比的性能。

方法

使用 NCCN 指南，我们针对卵巢癌管理在单一时间点生成了 10 个问题和答案。问题分为风险因素、手术管理、医学管理和监测。我们要求 ChatGPT（GPT-4）在没有提示（未提示 GPT）和提示工程（提示 GPT）的情况下提供回复。回复由 5 名妇科肿瘤学家进行盲法评估，以确定准确性和完整性。评分 0 定义为不准确，1 定义为准确但不完整，2 定义为准确且完整。将 NCCN、未提示 GPT 和提示 GPT 的答案进行了比较。

结果

总体而言，NCCN 的 48%、未提示 GPT 的 64%和提示 GPT 的 66%的回复是准确且完整的。与 GPT-4 相比，NCCN 的准确但不完整回复的百分比更高。关于风险因素、手术管理和监测的问题，GPT-4 的准确且完整评分的百分比高于 NCCN；然而，对于医学管理问题，GPT-4 的百分比低于 NCCN。总体而言，未提示 GPT 的 14%、提示 GPT 的 12%和 NCCN 的 10%的回复是不准确的。

结论

GPT-4 在单一时间点针对有限数量的卵巢癌问题提供了准确且完整的回复，在风险因素、手术管理和监测方面表现最佳。然而，偶尔的不准确应该限制此时对聊天机器人的无监督使用。

相似文献

ChatGPT compared to national guidelines for management of ovarian cancer: Did ChatGPT get it right? - A Memorial Sloan Kettering Cancer Center Team Ovary study.ChatGPT 与卵巢癌管理的国家指南比较：ChatGPT 是否做对了？- 纪念斯隆凯特琳癌症中心卵巢癌团队研究。

Gynecol Oncol. 2024 Oct;189:75-79. doi: 10.1016/j.ygyno.2024.07.007. Epub 2024 Jul 22.

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.ChatGPT在日本国家医师资格考试医学问题上的准确性：评估研究

JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较：评估研究。

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use.评估人工智能语言模型在提供甲氨蝶呤使用信息方面的准确性和完整性。

Rheumatol Int. 2024 Mar;44(3):509-515. doi: 10.1007/s00296-023-05473-5. Epub 2023 Sep 25.

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study.ChatGPT在秘鲁国家医学执照考试中的表现：横断面研究

JMIR Med Educ. 2023 Sep 28;9:e48039. doi: 10.2196/48039.

Accuracy and Reliability of Chatbot Responses to Physician Questions.聊天机器人对医生提问回答的准确性和可靠性。

JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.

A Generative Pretrained Transformer (GPT)-Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study.基于生成式预训练转换器（GPT）的聊天机器人作为模拟患者进行病史采集的实践研究：前瞻性混合方法研究。

JMIR Med Educ. 2024 Jan 16;10:e53961. doi: 10.2196/53961.

Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.模型演进和系统角色对 ChatGPT 在中文医师资格考试中表现的影响：对比研究。

JMIR Med Educ. 2024 Aug 13;10:e52784. doi: 10.2196/52784.

Accuracy, readability, and understandability of large language models for prostate cancer information to the public.大语言模型向公众提供前列腺癌信息的准确性、可读性和可理解性。

Prostate Cancer Prostatic Dis. 2024 May 14. doi: 10.1038/s41391-024-00826-y.

Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.大语言模型在胰腺肿瘤自动化综述报告和可切除性分类中的应用。

Radiology. 2024 Jun;311(3):e233117. doi: 10.1148/radiol.233117.

引用本文的文献

Responsible Artificial Intelligence governance in oncology.肿瘤学中的负责任人工智能治理

NPJ Digit Med. 2025 Jul 4;8(1):407. doi: 10.1038/s41746-025-01794-w.

Mapping the Advanced-Stage Epithelial Ovarian Cancer Landscape Goes Beyond Words: Two Large Language Models, Eight Tasks, One Journey.绘制晚期上皮性卵巢癌全景远非文字所能描述：两个大语言模型，八项任务，一段征程。

J Clin Med. 2025 Mar 25;14(7):2223. doi: 10.3390/jcm14072223.

The Performance of Artificial Intelligence in One Anastomosis Gastric Bypass Surgery: Comparative Efficacy of ChatGPT-4.0, ChatGPT-Omni, and Gemini AI.人工智能在单吻合口胃旁路手术中的表现：ChatGPT-4.0、ChatGPT-Omni和Gemini AI的疗效比较

Obes Surg. 2025 Apr;35(4):1469-1475. doi: 10.1007/s11695-025-07794-9. Epub 2025 Mar 18.

本文引用的文献

GPT-4 passes the bar exam.GPT-4通过了律师资格考试。

Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254. doi: 10.1098/rsta.2023.0254. Epub 2024 Feb 26.

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs.提示工程在与大语言模型基于证据的指南保持一致性和可靠性方面。

NPJ Digit Med. 2024 Feb 20;7(1):41. doi: 10.1038/s41746-024-01029-4.

Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review.医学领域的大型语言模型：潜力与陷阱：一篇叙事性综述。

Ann Intern Med. 2024 Feb;177(2):210-220. doi: 10.7326/M23-2772. Epub 2024 Jan 30.

Peer review of GPT-4 technical report and systems card.GPT-4技术报告和系统卡片的同行评审。

PLOS Digit Health. 2024 Jan 18;3(1):e0000417. doi: 10.1371/journal.pdig.0000417. eCollection 2024 Jan.

Cancer statistics, 2024.2024年癌症统计数据。

CA Cancer J Clin. 2024 Jan-Feb;74(1):12-49. doi: 10.3322/caac.21820. Epub 2024 Jan 17.

Let's chat about cervical cancer: Assessing the accuracy of ChatGPT responses to cervical cancer questions.让我们来聊聊宫颈癌：评估 ChatGPT 对宫颈癌问题回答的准确性。

Gynecol Oncol. 2023 Dec;179:164-168. doi: 10.1016/j.ygyno.2023.11.008. Epub 2023 Nov 21.

Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong.利用 ChatGPT 从香港 COVID-19 病例的自由文本回复中提取症状。

Clin Microbiol Infect. 2024 Jan;30(1):142.e1-142.e3. doi: 10.1016/j.cmi.2023.11.002. Epub 2023 Nov 8.

Accuracy and Reliability of Chatbot Responses to Physician Questions.聊天机器人对医生提问回答的准确性和可靠性。

JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.

Applications of large language models in cancer care: current evidence and future perspectives.大语言模型在癌症护理中的应用：当前证据与未来展望。

Front Oncol. 2023 Sep 4;13:1268915. doi: 10.3389/fonc.2023.1268915. eCollection 2023.

Value of Antibody Drug Conjugates for Gynecological Cancers: A Modern Appraisal Following Recent FDA Approvals.抗体药物偶联物在妇科癌症中的价值：继美国食品药品监督管理局近期批准后的现代评估

Int J Womens Health. 2023 Aug 28;15:1353-1365. doi: 10.2147/IJWH.S400537. eCollection 2023.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验