Department of Gastroenterology and Hepatology, Singapore General Hospital, Singapore.
Medicine Academic Clininical Programme, Duke-NUS Medical School, Singapore, Singapore.
J Gastroenterol Hepatol. 2024 Jan;39(1):81-106. doi: 10.1111/jgh.16375. Epub 2023 Oct 19.
Colonoscopy is commonly used in screening and surveillance for colorectal cancer. Multiple different guidelines provide recommendations on the interval between colonoscopies. This can be challenging for non-specialist healthcare providers to navigate. Large language models like ChatGPT are a potential tool for parsing patient histories and providing advice. However, the standard GPT model is not designed for medical use and can hallucinate. One way to overcome these challenges is to provide contextual information with medical guidelines to help the model respond accurately to queries. Our study compares the standard GPT4 against a contextualized model provided with relevant screening guidelines. We evaluated whether the models could provide correct advice for screening and surveillance intervals for colonoscopy.
Relevant guidelines pertaining to colorectal cancer screening and surveillance were formulated into a knowledge base for GPT. We tested 62 example case scenarios (three times each) on standard GPT4 and on a contextualized model with the knowledge base.
The contextualized GPT4 model outperformed the standard GPT4 in all domains. No high-risk features were missed, and only two cases had hallucination of additional high-risk features. A correct interval to colonoscopy was provided in the majority of cases. Guidelines were appropriately cited in almost all cases.
A contextualized GPT4 model could identify high-risk features and quote appropriate guidelines without significant hallucination. It gave a correct interval to the next colonoscopy in the majority of cases. This provides proof of concept that ChatGPT with appropriate refinement can serve as an accurate physician assistant.
结肠镜检查常用于结直肠癌的筛查和监测。多种不同的指南提供了结肠镜检查间隔的建议。对于非专业医疗保健提供者来说,这可能难以理解。像 ChatGPT 这样的大型语言模型是解析患者病史并提供建议的潜在工具。然而,标准的 GPT 模型并非专为医疗用途设计,可能会产生幻觉。克服这些挑战的一种方法是提供带有医疗指南的上下文信息,以帮助模型准确响应查询。我们的研究比较了标准的 GPT4 与提供相关筛查指南的上下文化模型。我们评估了这些模型是否能够为结肠镜检查的筛查和监测间隔提供正确的建议。
将与结直肠癌筛查和监测相关的指南制定成 GPT 的知识库。我们在标准 GPT4 和带有知识库的上下文化模型上测试了 62 个示例案例(每个案例测试三次)。
上下文化的 GPT4 模型在所有领域都优于标准 GPT4。没有错过高风险特征,只有两例出现了额外高风险特征的幻觉。在大多数情况下,提供了正确的结肠镜检查间隔。几乎所有情况下都适当引用了指南。
上下文化的 GPT4 模型可以在没有明显幻觉的情况下识别高风险特征并引用适当的指南。它在大多数情况下提供了下一次结肠镜检查的正确间隔。这证明了经过适当改进的 ChatGPT 可以作为准确的医生助手。