Lei Ian Io, Gaya Daniel R, Robertson Alexander, Schelde-Olesen Benedicte, Mapiye Alice, Bhandare Anirudh, Lui Bei Bei, Shekhar Chander, Valentiner Ursula, Gilabert Pere, Laiz Pablo, Segui Santi, Parsons Nicholas, Huhulea Cristiana, Wenzek Hagen, White Elizabeth, Koulaouzidis Anastasios, Arasaradnam Ramesh P
Institute of Precision Diagnostics & Translational Medicine, University Hospital of Coventry and Warwickshire, Clifford Bridge Rd, Coventry CV2 2DX, UK.
School of Medicine, University of Warwick, Coventry CV4 7AL, UK.
Cancers (Basel). 2025 Aug 29;17(17):2840. doi: 10.3390/cancers17172840.
Colon capsule endoscopy (CCE) has seen increased adoption since the COVID-19 pandemic, offering a non-invasive alternative for lower gastrointestinal investigations. However, inadequate bowel preparation remains a key limitation, often leading to higher conversion rates to colonoscopy. Manual assessment of bowel cleanliness is inherently subjective and marked by high interobserver variability. Recent advances in artificial intelligence (AI) have enabled automated cleansing scores that not only standardise assessment and reduce variability but also align with the emerging semi-automated AI reading workflow, which highlights only clinically significant frames. As full video review becomes less routine, reliable, and consistent, cleansing evaluation is essential, positioning bowel preparation AI as a critical enabler of diagnostic accuracy and scalable CCE deployment. This CESCAIL sub-study aimed to (1) evaluate interobserver agreement in CCE bowel cleansing assessment using two established scoring systems, and (2) determine the impact of AI-assisted scoring, specifically a TransUNet-based segmentation model with a custom Patch Loss function, on both interobserver and intraobserver agreement compared to manual assessment. As part of the CESCAIL study, twenty-five CCE videos were randomly selected from 673 participants. Nine readers with varying CCE experience scored bowel cleanliness using the Leighton-Rex and CC-CLEAR scales. After a minimum 8-week washout, the same readers reassessed the videos using AI-assisted CC-CLEAR scores. Interobserver variability was evaluated using bootstrapped intraclass correlation coefficients (ICC) and Fleiss' Kappa; intraobserver variability was assessed with weighted Cohen's Kappa, paired t-tests, and Two One-Sided Tests (TOSTs). Leighton-Rex showed poor to fair agreement (Fleiss = 0.14; ICC = 0.55), while CC-CLEAR demonstrated fair to excellent agreement (Fleiss = 0.27; ICC = 0.90). AI-assisted CC-CLEAR achieved only moderate agreement overall (Fleiss = 0.27; ICC = 0.69), with weaker performance among less experienced readers (Fleiss = 0.15; ICC = 0.56). Intraobserver agreement was excellent (ICC > 0.75) for experienced readers but variable in others (ICC 0.03-0.80). AI-assisted scores were significantly lower than manual reads by 1.46 points ( < 0.001), potentially increasing conversion to colonoscopy. AI-assisted scoring did not improve interobserver agreement and may even reduce consistency amongst less experienced readers. The maintained agreement observed in experienced readers highlights its current value in experienced hands only. Further refinement, including spatial analysis integration, is needed for robust overall AI implementation in CCE.
自新冠疫情以来,结肠胶囊内镜检查(CCE)的应用有所增加,为下消化道检查提供了一种非侵入性替代方法。然而,肠道准备不充分仍然是一个关键限制因素,常常导致结肠镜检查的转化率较高。人工评估肠道清洁度本质上是主观的,且观察者间差异很大。人工智能(AI)的最新进展使得能够实现自动清洁评分,这不仅使评估标准化并减少了变异性,还与新兴的半自动AI阅读工作流程相契合,该流程仅突出显示具有临床意义的帧。随着对完整视频的审查变得不那么常规、可靠和一致,清洁度评估至关重要,这使得肠道准备AI成为诊断准确性和可扩展CCE部署的关键推动因素。 这项CESCAIL子研究旨在:(1)使用两种既定的评分系统评估CCE肠道清洁度评估中的观察者间一致性;(2)与人工评估相比,确定AI辅助评分,特别是基于TransUNet的具有自定义Patch Loss函数的分割模型,对观察者间和观察者内一致性的影响。作为CESCAIL研究的一部分,从673名参与者中随机选择了25个CCE视频。九名具有不同CCE经验的读者使用Leighton-Rex和CC-CLEAR量表对肠道清洁度进行评分。在至少8周的洗脱期后,相同的读者使用AI辅助的CC-CLEAR评分重新评估视频。使用自举组内相关系数(ICC)和Fleiss' Kappa评估观察者间变异性;使用加权Cohen's Kappa、配对t检验和双侧单侧检验(TOST)评估观察者内变异性。Leighton-Rex显示出较差到中等的一致性(Fleiss = 0.14;ICC = 0.55),而CC-CLEAR表现出中等至优秀的一致性(Fleiss = 0.27;ICC = 0.90)。AI辅助的CC-CLEAR总体上仅达到中等一致性(Fleiss = 0.27;ICC = 0.69),在经验较少的读者中表现较弱(Fleiss = 0.15;ICC = 0.56)。经验丰富的读者的观察者内一致性优秀(ICC > 0.75),但其他读者的一致性则有所不同(ICC为0.03 - 0.80)。AI辅助评分比人工读数显著低1.46分(< 0.001),这可能会增加结肠镜检查的转化率。AI辅助评分并未提高观察者间一致性,甚至可能降低经验较少读者之间的一致性。在经验丰富的读者中观察到的一致性得以维持,这突出了其目前仅在经验丰富者手中的价值。要在CCE中全面稳健地实施AI,还需要进一步改进,包括整合空间分析。