Suppr超能文献

人工智能在眼病诊断中的工作流程、外部验证与发展

AI Workflow, External Validation, and Development in Eye Disease Diagnosis.

作者信息

Chen Qingyu, Keenan Tiarnan D L, Agron Elvira, Allot Alexis, Guan Emily, Duong Bryant, Elsawy Amr, Hou Benjamin, Xue Cancan, Bhandari Sanjeeb, Broadhead Geoffrey, Cousineau-Krieger Chantal, Davis Ellen, Gensheimer William G, Golshani Cyrus A, Grasic David, Gupta Seema, Haddock Luis, Konstantinou Eleni, Lamba Tania, Maiberger Michele, Mantopoulos Dimosthenis, Mehta Mitul C, Elnahry Ayman G, Al-Nawaflh Mutaz, Oshinsky Arnold, Powell Brittany E, Purt Boonkit, Shin Soo, Stiefel Hillary, Thavikulwat Alisa T, Wroblewski Keith James, Tham Yih Chung, Cheung Chui Ming Gemmy, Cheng Ching-Yu, Chew Emily Y, Hribar Michelle R, Chiang Michael F, Lu Zhiyong

机构信息

National Library of Medicine, National Institutes of Health, Bethesda, Maryland.

Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, Connecticut.

出版信息

JAMA Netw Open. 2025 Jul 1;8(7):e2517204. doi: 10.1001/jamanetworkopen.2025.17204.

Abstract

IMPORTANCE

Timely disease diagnosis is challenging due to limited clinical availability and growing burdens. Although artificial intelligence (AI) has shown expert-level diagnostic accuracy, a lack of downstream accountability, including workflow integration, external validation, and further development, continues to hinder its clinical adoption.

OBJECTIVE

To address gaps in the downstream accountability of medical AI through a case study on age-related macular degeneration (AMD) diagnosis and severity classification.

DESIGN, SETTING, AND PARTICIPANTS: This diagnostic study developed and evaluated an AI-assisted diagnostic and classification workflow for AMD. Four rounds of diagnostic assessments (accuracy and time) were conducted with 24 clinicians from 12 institutions. Each round was randomized and alternated between manual (clinician diagnosis) and manual plus AI (clinician assisted by AI diagnosis), with a 1-month washout period. In total, 2880 AMD risk features were evaluated across 960 images from 240 Age-Related Eye Disease Study patient samples, both with and without AI assistance. For further development, the original DeepSeeNet model was enhanced into the DeepSeeNet+ model using 39 196 additional images from the US population and tested on 3 datasets, including an external set from Singapore.

EXPOSURE

Age-related macular degeneration risk features.

MAIN OUTCOMES AND MEASURES

The F1 score for accuracy (Wilcoxon rank sum test) and diagnostic time (linear mixed-effects model) were measured, comparing manual vs manual plus AI. For further development, the F1 score (Wilcoxon rank sum test) was again used.

RESULTS

Among 240 patients (mean [SD] age, 68.5 [5.0] years; 127 female [53%]), AI assistance significantly improved accuracy for 23 of 24 clinicians, increasing the mean F1 score from 37.71 (95% CI, 27.83-44.17) to 45.52 (95% CI, 39.01-51.61), with some improvements exceeding 50%. Manual diagnosis initially took an estimated 39.8 seconds (95% CI, 34.1-45.6 seconds) per patient, whereas manual plus AI saved 10.3 seconds (95% CI, -15.1 to -5.5 seconds) and remained faster by 6.9 seconds (95% CI, 0.2-13.7 seconds) to 8.6 seconds (95% CI, 1.8-15.3 seconds) in subsequent rounds. However, combining manual and AI did not always yield the highest accuracy or efficiency, underscoring challenges in explainability and trust. The DeepSeeNet+ model performed better in 3 test sets, achieving a significantly higher F1 score than the Singapore cohort (52.43 [95% CI, 44.38-61.00] vs 38.95 [95% CI, 30.50-47.45]).

CONCLUSIONS AND RELEVANCE

In this diagnostic study, AI assistance was associated with improved accuracy and time efficiency for AMD diagnosis. Further development is essential for enhancing AI generalizability across diverse populations. These findings highlight the need for downstream accountability during early-stage clinical evaluations of medical AI.

摘要

重要性

由于临床可用性有限且负担不断增加,及时进行疾病诊断具有挑战性。尽管人工智能(AI)已显示出专家级的诊断准确性,但缺乏包括工作流程整合、外部验证和进一步开发在内的下游问责制,仍然阻碍了其在临床中的应用。

目的

通过一项关于年龄相关性黄斑变性(AMD)诊断和严重程度分类的案例研究,解决医学人工智能下游问责制方面的差距。

设计、设置和参与者:这项诊断研究开发并评估了一种用于AMD的人工智能辅助诊断和分类工作流程。对来自12个机构的24名临床医生进行了四轮诊断评估(准确性和时间)。每一轮评估都是随机的,在手动(临床医生诊断)和手动加人工智能(临床医生由人工智能辅助诊断)之间交替进行,有1个月的洗脱期。总共对来自240例年龄相关性眼病研究患者样本的960张图像中的2880个AMD风险特征进行了评估,评估过程中有无人工智能辅助。为了进一步开发,使用来自美国人群的另外39196张图像将原始的DeepSeeNet模型增强为DeepSeeNet+模型,并在3个数据集上进行测试,包括来自新加坡的一个外部数据集。

暴露因素

年龄相关性黄斑变性风险特征。

主要结局和测量指标

测量准确性的F1分数(Wilcoxon秩和检验)和诊断时间(线性混合效应模型),比较手动诊断与手动加人工智能诊断。为了进一步开发,再次使用F1分数(Wilcoxon秩和检验)。

结果

在240例患者(平均[标准差]年龄,68.5[5.0]岁;127名女性[53%])中,人工智能辅助显著提高了24名临床医生中23人的准确性,将平均F1分数从37.71(95%置信区间,27.83 - 44.17)提高到45.52(95%置信区间,39.01 - 51.61),有些提高超过了50%。最初,手动诊断每名患者估计需要39.8秒(95%置信区间,34.1 - 45.6秒),而手动加人工智能节省了10.3秒(95%置信区间,-15.1至-5.5秒),并且在随后的轮次中仍比手动诊断快6.9秒(95%置信区间,0.2 - 13.7秒)至8.6秒(95%置信区间,1.8 - 15.3秒)。然而,将手动诊断和人工智能相结合并不总是能产生最高的准确性或效率,这凸显了可解释性和信任方面的挑战。DeepSeeNet+模型在3个测试集中表现更好,其F1分数显著高于新加坡队列(52.43[95%置信区间,44.38 - 61.00]对38.95[95%置信区间,30.50 - 47.45])。

结论及相关性

在这项诊断研究中,人工智能辅助与提高AMD诊断的准确性和时间效率相关。进一步开发对于提高人工智能在不同人群中的通用性至关重要。这些发现凸显了在医学人工智能早期临床评估期间进行下游问责制的必要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a2f8/12268484/3063c975995d/jamanetwopen-e2517204-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验