Google Research, Google Inc., Mountain View, California.
Department of Ophthalmology, Palo Alto Medical Foundation, Palo Alto, California.
Ophthalmology. 2018 Aug;125(8):1264-1272. doi: 10.1016/j.ophtha.2018.01.034. Epub 2018 Mar 13.
Use adjudication to quantify errors in diabetic retinopathy (DR) grading based on individual graders and majority decision, and to train an improved automated algorithm for DR grading.
Retrospective analysis.
Retinal fundus images from DR screening programs.
Images were each graded by the algorithm, U.S. board-certified ophthalmologists, and retinal specialists. The adjudicated consensus of the retinal specialists served as the reference standard.
For agreement between different graders as well as between the graders and the algorithm, we measured the (quadratic-weighted) kappa score. To compare the performance of different forms of manual grading and the algorithm for various DR severity cutoffs (e.g., mild or worse DR, moderate or worse DR), we measured area under the curve (AUC), sensitivity, and specificity.
Of the 193 discrepancies between adjudication by retinal specialists and majority decision of ophthalmologists, the most common were missing microaneurysm (MAs) (36%), artifacts (20%), and misclassified hemorrhages (16%). Relative to the reference standard, the kappa for individual retinal specialists, ophthalmologists, and algorithm ranged from 0.82 to 0.91, 0.80 to 0.84, and 0.84, respectively. For moderate or worse DR, the majority decision of ophthalmologists had a sensitivity of 0.838 and specificity of 0.981. The algorithm had a sensitivity of 0.971, specificity of 0.923, and AUC of 0.986. For mild or worse DR, the algorithm had a sensitivity of 0.970, specificity of 0.917, and AUC of 0.986. By using a small number of adjudicated consensus grades as a tuning dataset and higher-resolution images as input, the algorithm improved in AUC from 0.934 to 0.986 for moderate or worse DR.
Adjudication reduces the errors in DR grading. A small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. Board-Certified ophthalmologists and retinal specialists.
利用鉴定来量化基于个体分级员和多数决策的糖尿病视网膜病变(DR)分级中的错误,并训练一种改进的 DR 分级自动化算法。
回顾性分析。
DR 筛查计划的视网膜眼底图像。
图像分别由算法、美国董事会认证的眼科医生和视网膜专家进行分级。视网膜专家的鉴定共识作为参考标准。
为了评估不同分级员之间以及分级员和算法之间的一致性,我们测量了(二次加权)kappa 评分。为了比较不同形式的手动分级和算法在各种 DR 严重程度截止值(例如,轻度或更严重的 DR,中度或更严重的 DR)下的性能,我们测量了曲线下面积(AUC)、敏感性和特异性。
在视网膜专家的鉴定与眼科医生多数决策之间的 193 个差异中,最常见的是微动脉瘤(MA)缺失(36%)、伪影(20%)和出血错误分类(16%)。与参考标准相比,个体视网膜专家、眼科医生和算法的 kappa 值分别为 0.82 至 0.91、0.80 至 0.84 和 0.84。对于中度或更严重的 DR,眼科医生的多数决策的敏感性为 0.838,特异性为 0.981。该算法的敏感性为 0.971,特异性为 0.923,AUC 为 0.986。对于轻度或更严重的 DR,该算法的敏感性为 0.970,特异性为 0.917,AUC 为 0.986。通过使用少量鉴定共识等级作为调整数据集并使用更高分辨率的图像作为输入,该算法在中度或更严重的 DR 中的 AUC 从 0.934 提高到 0.986。
鉴定可减少 DR 分级中的错误。少量的鉴定 DR 等级可大大提高算法的性能。由此产生的算法的性能与美国董事会认证的眼科医生和视网膜专家相当。