Rikhye Rajeev V, Loh Aaron, Hong Grace Eunhae, Singh Preeti, Smith Margaret Ann, Muralidharan Vijaytha, Wong Doris, Sayres Rory, Jain Ayush, Phung Michelle, Betancourt Nicolas, Fong Bradley, Sahasrabudhe Rachna, Nasim Khoban, Eschholz Alec, Mustafa Basil, Freyberg Jan, Spitz Terry, Matias Yossi, Corrado Greg S, Chou Katherine, Webster Dale R, Bui Peggy, Liu Yuan, Liu Yun, Ko Justin, Lin Steven
Google Research, Mountain View, CA, USA.
Stanford University School of Medicine, Stanford, CA, USA.
EBioMedicine. 2025 Jun;116:105766. doi: 10.1016/j.ebiom.2025.105766. Epub 2025 Jun 2.
Generalisation of artificial intelligence (AI) models to a new setting is challenging. In this study, we seek to understand the robustness of a dermatology (AI) model and whether it generalises from telemedicine cases to a new setting including both patient-submitted photographs ("PAT") and clinician-taken photographs in-clinic ("CLIN").
A retrospective cohort study involving 2500 cases previously unseen by the AI model, including both PAT and CLIN cases, from 22 clinics in the San Francisco Bay Area, spanning November 2015 to January 2021. The primary outcome measure for the AI model and dermatologists was the top-3 accuracy, defined as whether their top 3 differential diagnoses contained the top reference diagnosis from a panel of dermatologists per case.
The AI performed similarly between PAT and CLIN images (74% top-3 accuracy in CLIN vs. 71% in PAT), however, dermatologists were more accurate in PAT images (79% in CLIN vs. 87% in PAT). We demonstrate that demographic factors were not associated with AI or dermatologist errors; instead several categories of conditions were associated with AI model errors (p < 0.05). Resampling CLIN and PAT to match skin condition distributions to the AI development dataset reduced the observed differences (AI: 84% CLIN vs. 79% PAT; dermatologists: 77% CLIN vs. 89% PAT). We demonstrate a series of steps to close the generalisation gap, requiring progressively more information about the new dataset, ranging from the condition distribution to additional training data for rarer conditions. When using additional training data and testing on the dataset without resampling to match AI development, we observed comparable performance from end-to-end AI model fine tuning (85% in CLIN vs. 83% in PAT) vs. fine tuning solely the classification layer on top of a frozen embedding model (86% in CLIN vs. 84% in PAT).
AI algorithms can be efficiently adapted to new settings without additional training data by recalibrating the existing model, or with targeted data acquisition for rarer conditions and retraining just the final layer.
Google.
人工智能(AI)模型推广到新环境具有挑战性。在本研究中,我们试图了解皮肤病学AI模型的稳健性,以及它是否能从远程医疗病例推广到包括患者提交照片(“PAT”)和临床医生在诊所拍摄照片(“CLIN”)的新环境。
一项回顾性队列研究,涉及AI模型之前未见过的2500个病例,包括PAT和CLIN病例,来自旧金山湾区的22家诊所,时间跨度为2015年11月至2021年1月。AI模型和皮肤科医生的主要结局指标是前3名诊断准确率,定义为他们的前3个鉴别诊断是否包含每个病例中皮肤科医生小组的首要参考诊断。
AI在PAT和CLIN图像上的表现相似(CLIN图像的前3名诊断准确率为74%,PAT图像为71%),然而,皮肤科医生在PAT图像上的诊断更准确(CLIN图像为79%,PAT图像为87%)。我们证明人口统计学因素与AI或皮肤科医生的错误无关;相反,几类病症与AI模型错误相关(p<0.05)。对CLIN和PAT进行重采样,以使皮肤病症分布与AI开发数据集相匹配,减少了观察到的差异(AI:CLIN为84%,PAT为79%;皮肤科医生:CLIN为77%,PAT为89%)。我们展示了一系列缩小推广差距的步骤,这需要关于新数据集的信息逐渐增加,从病症分布到罕见病症的额外训练数据。当使用额外训练数据并在未进行重采样以匹配AI开发的数据集上进行测试时,我们观察到端到端AI模型微调(CLIN为85%,PAT为83%)与仅在冻结嵌入模型之上微调分类层(CLIN为86%,PAT为84%)的性能相当。
通过重新校准现有模型,或针对罕见病症进行有针对性的数据采集并仅对最后一层进行重新训练,AI算法可以在无需额外训练数据的情况下有效地适应新环境。
谷歌。