Chou Hung-Hsueh, Chen Yi Hua, Lin Chiu-Tzu, Chang Hsien-Tsung, Wu An-Chieh, Tsai Jia-Ling, Chen Hsiao-Wei, Hsu Ching-Chun, Liu Shu-Ya, Lee Jian Tao
Department of Obstetrics and Gynecology, Linkou Branch, Chang Gung Memorial Hospital, Tao-Yuan, Taiwan.
School of Medicine, National Tsing Hua University, Hsinchu, Taiwan.
Support Care Cancer. 2025 Apr 1;33(4):337. doi: 10.1007/s00520-025-09389-7.
Artificial intelligence (AI) chatbots, such as ChatGPT-4, allow a user to ask questions on an interactive level. This study evaluated the correctness and completeness of responses to questions about ovarian cancer from a GPT-4 chatbot, LilyBot, compared with responses from healthcare professionals in gynecologic cancer care.
Fifteen categories of questions about ovarian cancer were collected from an online patient Chatgroup forum. Ten healthcare professionals in gynecologic oncology generated 150 questions and responses relative to these topics. Responses from LilyBot and the healthcare professionals were scored for correctness and completeness by eight independent healthcare professionals with similar backgrounds blinded to the identity of the responders. Differences between groups were analyzed with Mann-Whitney U and Kruskal-Wallis tests, followed by Tukey's post hoc comparisons.
Mean scores for overall performance for all 150 questions were significantly higher for LilyBot compared with the healthcare professionals for correctness (5.31 ± 0.98 vs. 5.07 ± 1.00, p = 0.017; range = 1-6) and completeness (2.66 ± 0.55 vs. 2.36 ± 0.55, p < 0.001; range = 1-3). LilyBot had significantly higher scores for immunotherapy compared with the healthcare professionals for correctness (6.00 ± 0.00 vs. 4.70 ± 0.48, p = 0.020) and completeness (3.00 ± 0.00 vs. 2.00 ± 0.00, p < 0.010); and gene therapy for completeness (3.00 ± 0.00 vs. 2.20 ± 0.42, p = 0.023).
The significantly better performance by LilyBot compared with healthcare professionals highlights the potential of ChatGPT-4-based dialogue systems to provide patients with clinical information about ovarian cancer.
诸如ChatGPT-4之类的人工智能(AI)聊天机器人允许用户在交互层面上提问。本研究评估了GPT-4聊天机器人LilyBot对卵巢癌相关问题的回答的正确性和完整性,并与妇科癌症护理领域的医疗专业人员的回答进行了比较。
从一个在线患者聊天群组论坛收集了15类关于卵巢癌的问题。10名妇科肿瘤学医疗专业人员针对这些主题生成了150个问题及回答。LilyBot和医疗专业人员的回答由8名背景相似且对回答者身份不知情的独立医疗专业人员进行正确性和完整性评分。组间差异采用Mann-Whitney U检验和Kruskal-Wallis检验进行分析,随后进行Tukey事后比较。
在所有150个问题的整体表现方面,LilyBot在正确性(5.31±0.98 vs. 5.07±1.00,p = 0.017;范围 = 1 - 6)和完整性(2.66±0.55 vs. 2.36±0.55,p < 0.001;范围 = 1 - 3)上的平均得分显著高于医疗专业人员。在免疫治疗方面,LilyBot在正确性(6.00±0.00 vs. 4.70±0.48,p = 0.020)和完整性(3.00±0.00 vs. 2.00±0.00,p < 0.010)上的得分显著高于医疗专业人员;在基因治疗的完整性方面(3.00±0.00 vs. 2.20±0.42,p = 0.023)也是如此。
LilyBot的表现明显优于医疗专业人员,这凸显了基于ChatGPT-4的对话系统为患者提供卵巢癌临床信息的潜力。