Pourjavan Sayeh, Bourguignon Gen-Hua, Marinescu Cristina, Otjacques Loic, Boschi Antonella
Department of Ophthalmology, Cliniques Universitaires Saint Luc, UCL, Brussels, Belgium.
Department of Ophthalmology, Chirec Hospital Groups, Delta Hospital, Brussels, Belgium.
Clin Ophthalmol. 2024 Dec 27;18:3999-4009. doi: 10.2147/OPTH.S492872. eCollection 2024.
This study aims to evaluate the inter-observer variability in assessing the optic disc in fundus photographs and its implications for establishing ground truth in AI research.
Seventy subjects were screened during a screening campaign. Fundus photographs were classified into normal (NL) or abnormal (GS: glaucoma and glaucoma suspects) by two masked glaucoma specialists. Referrals were based on these classifications, followed by intraocular pressure (IOP) measurements, with rapid decisions simulating busy outpatient clinics.In the second stage, four glaucoma specialists independently categorized images as normal, suspect, or glaucomatous. Reassessments were conducted with access to IOP and contralateral eye data.
In the first stage, the agreement between senior and junior specialists in categorizing patients as normal or abnormal was moderately high. Knowledge of IOP emerged as an independent factor influencing the decision to refer more patients. In the second stage, agreement among the four specialists varied, with greater concordance observed when additional clinical information was available. Notably, there was a statistically significant variability in the assessment of optic disc excavation.
The inclusion of various risk factors significantly influences the classification accuracy of specialists. Risk factors like IOP and bilateral data influence diagnostic consistency among specialists. Reliance solely on fundus photographs for AI training can be misleading due to inter-observer variability. Comprehensive datasets integrating multimodal clinical information are essential for developing robust AI models for glaucoma screening.
本研究旨在评估眼底照片中视盘评估的观察者间变异性及其对人工智能研究中确定真值的影响。
在一次筛查活动中对70名受试者进行了筛查。两名蒙面青光眼专家将眼底照片分为正常(NL)或异常(GS:青光眼和青光眼疑似病例)。根据这些分类进行转诊,随后测量眼压(IOP),并迅速做出决策,模拟繁忙的门诊诊所。在第二阶段,四名青光眼专家将图像独立分类为正常、疑似或青光眼。在获取眼压和对侧眼数据的情况下进行重新评估。
在第一阶段,高级专家和初级专家在将患者分类为正常或异常方面的一致性较高。眼压知识成为影响转诊更多患者决策的一个独立因素。在第二阶段,四名专家之间的一致性各不相同,当有更多临床信息时,一致性更高。值得注意的是,在视盘凹陷评估方面存在统计学上显著的变异性。
纳入各种风险因素会显著影响专家的分类准确性。眼压和双侧数据等风险因素会影响专家之间的诊断一致性。由于观察者间的变异性,仅依靠眼底照片进行人工智能训练可能会产生误导。整合多模态临床信息的综合数据集对于开发强大的青光眼筛查人工智能模型至关重要。