Suppr超能文献

金标准标签错误对评估深度学习模型在糖尿病视网膜病变筛查中的性能的影响:全国真实世界验证研究。

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study.

机构信息

State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China.

School of Optometry, The Hong Kong Polytechnic University, Kowloon, China (Hong Kong).

出版信息

J Med Internet Res. 2024 Aug 14;26:e52506. doi: 10.2196/52506.

Abstract

BACKGROUND

For medical artificial intelligence (AI) training and validation, human expert labels are considered the gold standard that represents the correct answers or desired outputs for a given data set. These labels serve as a reference or benchmark against which the model's predictions are compared.

OBJECTIVE

This study aimed to assess the accuracy of a custom deep learning (DL) algorithm on classifying diabetic retinopathy (DR) and further demonstrate how label errors may contribute to this assessment in a nationwide DR-screening program.

METHODS

Fundus photographs from the Lifeline Express, a nationwide DR-screening program, were analyzed to identify the presence of referable DR using both (1) manual grading by National Health Service England-certificated graders and (2) a DL-based DR-screening algorithm with validated good lab performance. To assess the accuracy of labels, a random sample of images with disagreement between the DL algorithm and the labels was adjudicated by ophthalmologists who were masked to the previous grading results. The error rates of labels in this sample were then used to correct the number of negative and positive cases in the entire data set, serving as postcorrection labels. The DL algorithm's performance was evaluated against both pre- and postcorrection labels.

RESULTS

The analysis included 736,083 images from 237,824 participants. The DL algorithm exhibited a gap between the real-world performance and the lab-reported performance in this nationwide data set, with a sensitivity increase of 12.5% (from 79.6% to 92.5%, P<.001) and a specificity increase of 6.9% (from 91.6% to 98.5%, P<.001). In the random sample, 63.6% (560/880) of negative images and 5.2% (140/2710) of positive images were misclassified in the precorrection human labels. High myopia was the primary reason for misclassifying non-DR images as referable DR images, while laser spots were predominantly responsible for misclassified referable cases. The estimated label error rate for the entire data set was 1.2%. The label correction was estimated to bring about a 12.5% enhancement in the estimated sensitivity of the DL algorithm (P<.001).

CONCLUSIONS

Label errors based on human image grading, although in a small percentage, can significantly affect the performance evaluation of DL algorithms in real-world DR screening.

摘要

背景

对于医学人工智能(AI)的培训和验证,人类专家标签被认为是黄金标准,代表了给定数据集的正确答案或预期输出。这些标签可作为模型预测的参考或基准。

目的

本研究旨在评估一种定制的深度学习(DL)算法对糖尿病视网膜病变(DR)分类的准确性,并进一步展示在全国性 DR 筛查计划中,标签错误如何影响该评估。

方法

利用 Lifeline Express(全国性 DR 筛查计划)的眼底照片,使用(1)英国国民保健署认证分级员的手动分级和(2)具有良好实验室性能验证的基于 DL 的 DR 筛查算法,来识别可转诊 DR 的存在。为了评估标签的准确性,对 DL 算法和标签之间存在分歧的图像进行了随机抽样,并由对先前分级结果不知情的眼科医生进行裁决。然后,使用该样本中的标签错误率来纠正整个数据集的阴性和阳性病例数量,作为校正后标签。该 DL 算法的性能是针对校正前和校正后标签进行评估的。

结果

分析共纳入了来自 237824 名参与者的 736083 张图像。该 DL 算法在全国范围内的数据集中表现出实际性能与实验室报告性能之间的差距,其敏感性提高了 12.5%(从 79.6%提高到 92.5%,P<.001),特异性提高了 6.9%(从 91.6%提高到 98.5%,P<.001)。在随机样本中,在未校正的人类标签中,有 63.6%(560/880)的阴性图像和 5.2%(140/2710)的阳性图像被错误分类。高度近视是将非 DR 图像错误分类为可转诊 DR 图像的主要原因,而激光斑点则主要导致错误分类的可转诊病例。整个数据集的估计标签错误率为 1.2%。标签校正估计使 DL 算法的估计敏感性提高了 12.5%(P<.001)。

结论

基于人工图像分级的标签错误,尽管占比很小,但会显著影响在实际 DR 筛查中对 DL 算法的性能评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5381/11358665/04d7215ebe85/jmir_v26i1e52506_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验