Siu Adrian H Y, Gibson Damien P, Chiu Chris, Kwok Allan, Irwin Matt, Christie Adam, Koh Cherry E, Keshava Anil, Reece Mifanwy, Suen Michael, Rickard Matthew J F X
Concord Institute of Academic Surgery (CIAS), Concord Repatriation General Hospital, Concord, New South Wales, Australia.
Faculty of Medicine and Health, Central Clinical School, The University of Sydney, Sydney, New South Wales, Australia.
Colorectal Dis. 2025 Jan;27(1):e17267. doi: 10.1111/codi.17267. Epub 2024 Dec 17.
Artificial intelligence (AI) chatbots such as Chat Generative Pretrained Transformer-4 (ChatGPT-4) have made significant strides in generating human-like responses. Trained on an extensive corpus of medical literature, ChatGPT-4 has the potential to augment patient education materials. These chatbots may be beneficial to populations considering a diagnosis of colorectal cancer (CRC). However, the accuracy and quality of patient education materials are crucial for informed decision-making. Given workforce demands impacting holistic care, AI chatbots can bridge gaps in CRC information, reaching wider demographics and crossing language barriers. However, rigorous evaluation is essential to ensure accuracy, quality and readability. Therefore, this study aims to evaluate the efficacy, quality and readability of answers generated by ChatGPT-4 on CRC, utilizing patient-style question prompts.
To evaluate ChatGPT-4, eight CRC-related questions were derived using peer-reviewed literature and Google Trends. Eight colorectal surgeons evaluated AI responses for accuracy, safety, appropriateness, actionability and effectiveness. Quality was assessed using validated tools: the Patient Education Materials Assessment Tool (PEMAT-AI), modified DISCERN (DISCERN-AI) and Global Quality Score (GQS). A number of readability assessments were measured including Flesch Reading Ease (FRE) and the Gunning Fog Index (GFI).
The responses were generally accurate (median 4.00), safe (4.25), appropriate (4.00), actionable (4.00) and effective (4.00). Quality assessments rated PEMAT-AI as 'very good' (71.43), DISCERN-AI as 'fair' (12.00) and GQS as 'high' (4.00). Readability scores indicated difficulty (FRE 47.00, GFI 12.40), suggesting a higher educational level was required.
This study concludes that ChatGPT-4 is capable of providing safe but nonspecific medical information, suggesting its potential as a patient education aid. However, enhancements in readability through contextual prompting and fine-tuning techniques are required before considering implementation into clinical practice.
诸如聊天生成预训练变换器4(ChatGPT-4)之类的人工智能(AI)聊天机器人在生成类人回复方面取得了重大进展。ChatGPT-4基于大量医学文献语料库进行训练,有潜力增强患者教育材料。这些聊天机器人可能对考虑结直肠癌(CRC)诊断的人群有益。然而,患者教育材料的准确性和质量对于明智的决策至关重要。鉴于劳动力需求影响整体护理,AI聊天机器人可以弥合CRC信息方面的差距,覆盖更广泛的人群并跨越语言障碍。然而,严格的评估对于确保准确性、质量和可读性至关重要。因此,本研究旨在利用患者风格的问题提示来评估ChatGPT-4生成的关于CRC的答案的有效性、质量和可读性。
为了评估ChatGPT-4,使用同行评审文献和谷歌趋势得出了八个与CRC相关的问题。八位结直肠外科医生评估了AI回复的准确性、安全性、适当性、可操作性和有效性。使用经过验证的工具评估质量:患者教育材料评估工具(PEMAT-AI)、改良的辨别工具(DISCERN-AI)和全球质量评分(GQS)。测量了一些可读性评估指标,包括弗莱什易读性指数(FRE)和冈宁雾度指数(GFI)。
回复总体上准确(中位数4.00)、安全(4.25)、适当(4.00)、可操作(4.00)且有效(4.00)。质量评估将PEMAT-AI评为“非常好”(71.43),DISCERN-AI评为“一般”(12.00),GQS评为“高”(4.00)。可读性分数表明存在难度(FRE 47.00,GFI 12.40),这表明需要更高的教育水平。
本研究得出结论,ChatGPT-4能够提供安全但非特定的医学信息,表明其作为患者教育辅助工具的潜力。然而,在考虑将其应用于临床实践之前,需要通过上下文提示和微调技术提高可读性。