Olsen Griffin H, Goodman Emmett D, Aklilu Josiah G, Bartoletti Sebastiano, Hung Kay S, Yang Janice H, Sorenson Eric C, Jopling Jeffrey K, Yeung Serena Y, Azagury Dan E
Intermountain Healthcare Delivery Institute, Intermountain Health, Salt Lake City, UT, USA.
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
Surg Endosc. 2025 Aug 18. doi: 10.1007/s00464-025-12015-6.
Determining cholecystitis severity via the clinically validated Parkland Grading Scale (PGS) is useful for predicting case difficulty and likelihood of postoperative complications. A panel assessment by multiple surgeons can reduce variation in PGS due to subjectivity, but is time-consuming. An artificial intelligence (AI) model trained on the assessments of an expert clinician panel may improve efficiency and reduce variability in diagnosis in image-based assessments.
Laparoscopic cholecystectomy videos were obtained from one public and two private data sources. Representative frames were chosen for PGS grading and manually labeled. Three surgical experts independently assigned PGS scores to the selected frames. They then convened as a panel to decide on the score if those were discrepant at individual scoring. Weighted Cohen's kappa statistic was measured for inter-rater variability. Two AI models were developed for automated PGS grading and their accuracy and interpretability evaluated.
319 videos were compiled. Three surgical experts independently assigned identical PGS grades for 51% of cases, and weighted Cohen's kappa statistics ranged between 0.76 and 0.83. The accuracy of Model A using absolute agreement with the expert panel's consensus was 69%, and weighted Cohen's kappa statistic was 0.62. The accuracy of Model B using absolute agreement with the panel's consensus was 72%, and weighted Cohen's kappa statistic was 0.77. Interpretability analysis was conducted. Three anatomical structures played a key role in Model B's grading of cholecystitis severity: the appearance of the gallbladder, liver, and omentum had notable impact on performance.
A transformer-based AI model can be trained on consensus from an expert panel to predict ratings of cholecystitis severity (Parking Grading Scale), performing competitively with some individual experts at predicting PGS when compared to the panel-based ground truth. However, variance and subjectivity of PGS remain, thus presenting its limitations as a ground truth for computer vision-based models.
通过临床验证的帕克兰分级量表(PGS)确定胆囊炎的严重程度,有助于预测病例难度和术后并发症的可能性。由多名外科医生进行的小组评估可以减少由于主观性导致的PGS差异,但耗时较长。基于专家临床医生小组评估训练的人工智能(AI)模型,可能会提高基于图像评估的诊断效率并减少变异性。
从一个公共和两个私人数据源获取腹腔镜胆囊切除术视频。选择代表性帧进行PGS分级并手动标注。三位外科专家独立为选定的帧分配PGS评分。如果个体评分存在差异,他们随后作为一个小组开会决定评分。测量评分者间变异性的加权科恩kappa统计量。开发了两个用于自动PGS分级的AI模型,并评估其准确性和可解释性。
汇编了319个视频。三位外科专家对51%的病例独立分配了相同的PGS等级,加权科恩kappa统计量在0.76至0.83之间。模型A与专家小组共识绝对一致的准确率为69%,加权科恩kappa统计量为0.62。模型B与小组共识绝对一致的准确率为72%,加权科恩kappa统计量为0.77。进行了可解释性分析。三个解剖结构在模型B对胆囊炎严重程度的分级中起关键作用:胆囊、肝脏和网膜的外观对性能有显著影响。
基于专家小组共识训练的基于Transformer的AI模型,可以预测胆囊炎严重程度的评级(帕克兰分级量表),与一些个体专家相比,在与基于小组的地面真值相比预测PGS时具有竞争力。然而,PGS的变异性和主观性仍然存在,因此作为基于计算机视觉的模型的地面真值存在局限性。