金标准标签错误对评估深度学习模型在糖尿病视网膜病变筛查中的性能的影响：全国真实世界验证研究。

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study.

机构信息

State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China.

School of Optometry, The Hong Kong Polytechnic University, Kowloon, China (Hong Kong).

出版信息

J Med Internet Res. 2024 Aug 14;26:e52506. doi: 10.2196/52506.

DOI:10.2196/52506

PMID:39141915

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11358665/

Abstract

BACKGROUND

For medical artificial intelligence (AI) training and validation, human expert labels are considered the gold standard that represents the correct answers or desired outputs for a given data set. These labels serve as a reference or benchmark against which the model's predictions are compared.

OBJECTIVE

This study aimed to assess the accuracy of a custom deep learning (DL) algorithm on classifying diabetic retinopathy (DR) and further demonstrate how label errors may contribute to this assessment in a nationwide DR-screening program.

METHODS

Fundus photographs from the Lifeline Express, a nationwide DR-screening program, were analyzed to identify the presence of referable DR using both (1) manual grading by National Health Service England-certificated graders and (2) a DL-based DR-screening algorithm with validated good lab performance. To assess the accuracy of labels, a random sample of images with disagreement between the DL algorithm and the labels was adjudicated by ophthalmologists who were masked to the previous grading results. The error rates of labels in this sample were then used to correct the number of negative and positive cases in the entire data set, serving as postcorrection labels. The DL algorithm's performance was evaluated against both pre- and postcorrection labels.

RESULTS

The analysis included 736,083 images from 237,824 participants. The DL algorithm exhibited a gap between the real-world performance and the lab-reported performance in this nationwide data set, with a sensitivity increase of 12.5% (from 79.6% to 92.5%, P<.001) and a specificity increase of 6.9% (from 91.6% to 98.5%, P<.001). In the random sample, 63.6% (560/880) of negative images and 5.2% (140/2710) of positive images were misclassified in the precorrection human labels. High myopia was the primary reason for misclassifying non-DR images as referable DR images, while laser spots were predominantly responsible for misclassified referable cases. The estimated label error rate for the entire data set was 1.2%. The label correction was estimated to bring about a 12.5% enhancement in the estimated sensitivity of the DL algorithm (P<.001).

CONCLUSIONS

Label errors based on human image grading, although in a small percentage, can significantly affect the performance evaluation of DL algorithms in real-world DR screening.

摘要

背景

对于医学人工智能（AI）的培训和验证，人类专家标签被认为是黄金标准，代表了给定数据集的正确答案或预期输出。这些标签可作为模型预测的参考或基准。

目的

本研究旨在评估一种定制的深度学习（DL）算法对糖尿病视网膜病变（DR）分类的准确性，并进一步展示在全国性 DR 筛查计划中，标签错误如何影响该评估。

方法

利用 Lifeline Express（全国性 DR 筛查计划）的眼底照片，使用（1）英国国民保健署认证分级员的手动分级和（2）具有良好实验室性能验证的基于 DL 的 DR 筛查算法，来识别可转诊 DR 的存在。为了评估标签的准确性，对 DL 算法和标签之间存在分歧的图像进行了随机抽样，并由对先前分级结果不知情的眼科医生进行裁决。然后，使用该样本中的标签错误率来纠正整个数据集的阴性和阳性病例数量，作为校正后标签。该 DL 算法的性能是针对校正前和校正后标签进行评估的。

结果

分析共纳入了来自 237824 名参与者的 736083 张图像。该 DL 算法在全国范围内的数据集中表现出实际性能与实验室报告性能之间的差距，其敏感性提高了 12.5%（从 79.6%提高到 92.5%，P<.001），特异性提高了 6.9%（从 91.6%提高到 98.5%，P<.001）。在随机样本中，在未校正的人类标签中，有 63.6%（560/880）的阴性图像和 5.2%（140/2710）的阳性图像被错误分类。高度近视是将非 DR 图像错误分类为可转诊 DR 图像的主要原因，而激光斑点则主要导致错误分类的可转诊病例。整个数据集的估计标签错误率为 1.2%。标签校正估计使 DL 算法的估计敏感性提高了 12.5%（P<.001）。

结论

基于人工图像分级的标签错误，尽管占比很小，但会显著影响在实际 DR 筛查中对 DL 算法的性能评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5381/11358665/04d7215ebe85/jmir_v26i1e52506_fig1.jpg

相似文献

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study.

J Med Internet Res. 2024 Aug 14;26:e52506. doi: 10.2196/52506.

Artificial intelligence for diagnosing exudative age-related macular degeneration.

Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

Optical coherence tomography (OCT) for detection of macular oedema in patients with diabetic retinopathy.

Cochrane Database Syst Rev. 2015 Jan 7;1(1):CD008081. doi: 10.1002/14651858.CD008081.pub3.

Optical coherence tomography (OCT) for detection of macular oedema in patients with diabetic retinopathy.

Cochrane Database Syst Rev. 2011 Jul 6(7):CD008081. doi: 10.1002/14651858.CD008081.pub2.

Prescription of Controlled Substances: Benefits and Risks

Comparison of Deep Learning and Clinician Performance for Detecting Referable Glaucoma from Fundus Photographs in a Safety Net Population.

Ophthalmol Sci. 2025 Feb 25;5(4):100751. doi: 10.1016/j.xops.2025.100751. eCollection 2025 Jul-Aug.

Expert-Level Detection of Referable Glaucoma from Fundus Photographs in a Safety Net Population: The AI and Teleophthalmology in Los Angeles Initiative.

medRxiv. 2024 Aug 26:2024.08.25.24312563. doi: 10.1101/2024.08.25.24312563.

Validation of artificial intelligence algorithm LuxIA for screening of diabetic retinopathy from a single 45° retinal colour fundus images: the CARDS study.

BMJ Open Ophthalmol. 2025 May 8;10(1):e002109. doi: 10.1136/bmjophth-2024-002109.

Performance of a Deep Learning Diabetic Retinopathy Algorithm in India.

JAMA Netw Open. 2025 Mar 3;8(3):e250984. doi: 10.1001/jamanetworkopen.2025.0984.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

引用本文的文献

A scoping review of artificial intelligence as a medical device for ophthalmic image analysis in Europe, Australia and America.

NPJ Digit Med. 2025 May 29;8(1):323. doi: 10.1038/s41746-025-01726-8.

Segmentation of Low-Grade Brain Tumors Using Mutual Attention Multimodal MRI.

Sensors (Basel). 2024 Nov 27;24(23):7576. doi: 10.3390/s24237576.

本文引用的文献

Diabetic Retinopathy Telemedicine Outcomes With Artificial Intelligence-Based Image Analysis, Reflex Dilation, and Image Overread.

Am J Ophthalmol. 2022 Dec;244:125-132. doi: 10.1016/j.ajo.2022.08.008. Epub 2022 Aug 13.

Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: a prospective interventional cohort study.

Lancet Digit Health. 2022 Apr;4(4):e235-e244. doi: 10.1016/S2589-7500(22)00017-6. Epub 2022 Mar 7.

A stratified analysis of a deep learning algorithm in the diagnosis of diabetic retinopathy in a real-world study.

J Diabetes. 2022 Feb;14(2):111-120. doi: 10.1111/1753-0407.13241. Epub 2021 Dec 9.

Real-world artificial intelligence-based opportunistic screening for diabetic retinopathy in endocrinology and indigenous healthcare settings in Australia.

Sci Rep. 2021 Aug 4;11(1):15808. doi: 10.1038/s41598-021-94178-5.

Application of Comprehensive Artificial intelligence Retinal Expert (CARE) system: a national real-world evidence study.

Lancet Digit Health. 2021 Aug;3(8):e486-e495. doi: 10.1016/S2589-7500(21)00086-8.

Role of Artificial Intelligence Applications in Real-Life Clinical Practice: Systematic Review.

J Med Internet Res. 2021 Apr 22;23(4):e25759. doi: 10.2196/25759.

Diabetic Retinopathy Screening Using Artificial Intelligence and Handheld Smartphone-Based Retinal Camera.

J Diabetes Sci Technol. 2022 May;16(3):716-723. doi: 10.1177/1932296820985567. Epub 2021 Jan 12.

Multicenter, Head-to-Head, Real-World Validation Study of Seven Automated Artificial Intelligence Diabetic Retinopathy Screening Systems.

Diabetes Care. 2021 May;44(5):1168-1175. doi: 10.2337/dc20-1877. Epub 2021 Jan 5.

Evaluation of a novel artificial intelligence-based screening system for diabetic retinopathy in community of China: a real-world study.

Int Ophthalmol. 2021 Apr;41(4):1291-1299. doi: 10.1007/s10792-020-01685-x. Epub 2021 Jan 3.

Artificial intelligence-enabled screening for diabetic retinopathy: a real-world, multicenter and prospective study.

BMJ Open Diabetes Res Care. 2020 Oct;8(1). doi: 10.1136/bmjdrc-2020-001596.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

金标准标签错误对评估深度学习模型在糖尿病视网膜病变筛查中的性能的影响：全国真实世界验证研究。

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献