Silverman Anna L, Bhasuran Balu, Mosenia Arman, Yasini Fatema, Ramasamy Gokul, Banerjee Imon, Gupta Saransh, Mardirossian Taline, Narain Rohan, Sewell Justin, Butte Atul J, Rudrapatna Vivek A
Division of Gastroenterology and Hepatology, Department of Medicine, Mayo Clinic, Phoenix, AZ, USA.
Department of Medicine, University of California, San Diego, La Jolla, CA, USA.
Inflamm Bowel Dis. 2025 Mar 3;31(3):665-670. doi: 10.1093/ibd/izae068.
The Mayo endoscopic subscore (MES) is an important quantitative measure of disease activity in ulcerative colitis. Colonoscopy reports in routine clinical care usually characterize ulcerative colitis disease activity using free text description, limiting their utility for clinical research and quality improvement. We sought to develop algorithms to classify colonoscopy reports according to their MES.
We annotated 500 colonoscopy reports from 2 health systems. We trained and evaluated 4 classes of algorithms. Our primary outcome was accuracy in identifying scorable reports (binary) and assigning an MES (ordinal). Secondary outcomes included learning efficiency, generalizability, and fairness.
Automated machine learning models achieved 98% and 97% accuracy on the binary and ordinal prediction tasks, outperforming other models. Binary models trained on the University of California, San Francisco data alone maintained accuracy (96%) on validation data from Zuckerberg San Francisco General. When using 80% of the training data, models remained accurate for the binary task (97% [n = 320]) but lost accuracy on the ordinal task (67% [n = 194]). We found no evidence of bias by gender (P = .65) or area deprivation index (P = .80).
We derived a highly accurate pair of models capable of classifying reports by their MES and recognizing when to abstain from prediction. Our models were generalizable on outside institution validation. There was no evidence of algorithmic bias. Our methods have the potential to enable retrospective studies of treatment effectiveness, prospective identification of patients meeting study criteria, and quality improvement efforts in inflammatory bowel diseases.
梅奥内镜亚评分(MES)是溃疡性结肠炎疾病活动的一项重要定量指标。常规临床护理中的结肠镜检查报告通常使用自由文本描述来表征溃疡性结肠炎的疾病活动,这限制了它们在临床研究和质量改进中的效用。我们试图开发算法,根据MES对结肠镜检查报告进行分类。
我们对来自2个医疗系统的500份结肠镜检查报告进行了注释。我们训练并评估了4类算法。我们的主要结果是识别可评分报告(二元)和分配MES(有序)的准确性。次要结果包括学习效率、可推广性和公平性。
自动化机器学习模型在二元和有序预测任务上分别达到了98%和97%的准确率,优于其他模型。仅在加利福尼亚大学旧金山分校数据上训练的二元模型在来自扎克伯格旧金山综合医院的验证数据上保持了准确性(96%)。当使用80%的训练数据时,模型在二元任务上仍然准确(97%[n = 320]),但在有序任务上失去了准确性(67%[n = 194])。我们没有发现性别(P = 0.65)或地区贫困指数(P = 0.80)存在偏差的证据。
我们得出了一对高度准确的模型,能够根据MES对报告进行分类,并识别何时放弃预测。我们的模型在外部机构验证中具有可推广性。没有算法偏差的证据。我们的方法有可能用于炎症性肠病治疗效果的回顾性研究、符合研究标准患者的前瞻性识别以及质量改进工作。