Maida Marcello, Ramai Daryl, Mori Yuichi, Dinis-Ribeiro Mário, Facciorusso Antonio, Hassan Cesare
Department of Medicine and Surgery, University of Enna "Kore," Enna, Italy.
Division of Gastroenterology and Hepatology, University of Utah Health, Salt Lake City, Utah, USA.
Endoscopy. 2025 Mar;57(3):262-268. doi: 10.1055/a-2388-6084. Epub 2024 Aug 14.
This study aimed to evaluate the effectiveness of ChatGPT (Chat Generative Pretrained Transformer) in answering patients' questions about colorectal cancer (CRC) screening, with the ultimate goal of enhancing patients' awareness and adherence to national screening programs.
15 questions on CRC screening were posed to ChatGPT4. The answers were rated by 20 gastroenterology experts and 20 nonexperts in three domains (accuracy, completeness, and comprehensibility), and by 100 patients in three dichotomic domains (completeness, comprehensibility, and trustability).
According to expert rating, the mean (SD) accuracy score was 4.8 (1.1), on a scale ranging from 1 to 6. The mean (SD) scores for completeness and comprehensibility were 2.1 (0.7) and 2.8 (0.4), respectively, on scales ranging from 1 to 3. Overall, the mean (SD) accuracy (4.8 [1.1] vs. 5.6 [0.7]; < 0.001) and completeness scores (2.1 [0.7] vs. 2.7 [0.4]; < 0.001) were significantly lower for the experts than for the nonexperts, while comprehensibility was comparable among the two groups (2.8 [0.4] vs. 2.8 [0.3]; = 0.55). Patients rated all questions as complete, comprehensible, and trustable in between 97 % and 100 % of cases.
ChatGPT shows good performance, with the potential to enhance awareness about CRC and improve screening outcomes. Generative language systems may be further improved after proper training in accordance with scientific evidence and current guidelines.
本研究旨在评估ChatGPT(聊天生成预训练变换器)回答患者关于结直肠癌(CRC)筛查问题的有效性,最终目标是提高患者对国家筛查计划的认识和依从性。
向ChatGPT4提出了15个关于CRC筛查的问题。20名胃肠病学专家和20名非专家在三个领域(准确性、完整性和可理解性)对答案进行评分,100名患者在三个二分领域(完整性、可理解性和可信度)对答案进行评分。
根据专家评分,在1至6分的量表上,平均(标准差)准确性得分为4.8(1.1)。在1至3分的量表上,完整性和可理解性的平均(标准差)得分分别为2.1(0.7)和2.8(0.4)。总体而言,专家的平均(标准差)准确性得分(4.8 [1.1] 对5.6 [0.7];<0.001)和完整性得分(2.1 [0.7] 对2.7 [0.4];<0.001)显著低于非专家,而两组之间的可理解性相当(2.8 [0.4] 对2.8 [0.3];=0.55)。在97%至100%的情况下,患者将所有问题评为完整、可理解和可信。
ChatGPT表现良好,有提高对CRC的认识和改善筛查结果的潜力。生成式语言系统在根据科学证据和现行指南进行适当训练后可能会得到进一步改进。