Oakden-Rayner Lauren, Gale William, Bonham Thomas A, Lungren Matthew P, Carneiro Gustavo, Bradley Andrew P, Palmer Lyle J
School of Public Health, University of Adelaide, Adelaide, SA, Australia; Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia.
Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia; School of Computer Science, University of Adelaide, Adelaide, SA, Australia.
Lancet Digit Health. 2022 May;4(5):e351-e358. doi: 10.1016/S2589-7500(22)00004-8. Epub 2022 Apr 5.
Proximal femoral fractures are an important clinical and public health issue associated with substantial morbidity and early mortality. Artificial intelligence might offer improved diagnostic accuracy for these fractures, but typical approaches to testing of artificial intelligence models can underestimate the risks of artificial intelligence-based diagnostic systems.
We present a preclinical evaluation of a deep learning model intended to detect proximal femoral fractures in frontal x-ray films in emergency department patients, trained on films from the Royal Adelaide Hospital (Adelaide, SA, Australia). This evaluation included a reader study comparing the performance of the model against five radiologists (three musculoskeletal specialists and two general radiologists) on a dataset of 200 fracture cases and 200 non-fractures (also from the Royal Adelaide Hospital), an external validation study using a dataset obtained from Stanford University Medical Center, CA, USA, and an algorithmic audit to detect any unusual or unexpected model behaviour.
In the reader study, the area under the receiver operating characteristic curve (AUC) for the performance of the deep learning model was 0·994 (95% CI 0·988-0·999) compared with an AUC of 0·969 (0·960-0·978) for the five radiologists. This strong model performance was maintained on external validation, with an AUC of 0·980 (0·931-1·000). However, the preclinical evaluation identified barriers to safe deployment, including a substantial shift in the model operating point on external validation and an increased error rate on cases with abnormal bones (eg, Paget's disease).
The model outperformed the radiologists tested and maintained performance on external validation, but showed several unexpected limitations during further testing. Thorough preclinical evaluation of artificial intelligence models, including algorithmic auditing, can reveal unexpected and potentially harmful behaviour even in high-performance artificial intelligence systems, which can inform future clinical testing and deployment decisions.
None.
股骨近端骨折是一个重要的临床和公共卫生问题,与严重的发病率和早期死亡率相关。人工智能可能会提高这些骨折的诊断准确性,但人工智能模型的典型测试方法可能会低估基于人工智能的诊断系统的风险。
我们对一个深度学习模型进行了临床前评估,该模型旨在检测急诊科患者的正位X线片中的股骨近端骨折,使用澳大利亚南澳大利亚州阿德莱德皇家医院的X线片进行训练。该评估包括一项阅片者研究,在一个包含200例骨折病例和200例非骨折病例(同样来自阿德莱德皇家医院)的数据集上,将该模型的表现与五名放射科医生(三名肌肉骨骼专科医生和两名普通放射科医生)的表现进行比较;一项外部验证研究,使用从美国加利福尼亚州斯坦福大学医学中心获得的数据集;以及一项算法审核,以检测模型的任何异常或意外行为。
在阅片者研究中,深度学习模型的受试者工作特征曲线下面积(AUC)为0.994(95%CI 0.988 - 0.999),而五名放射科医生的AUC为0.969(0.960 - 0.978)。在外部验证中,该模型出色的表现得以维持,AUC为0.980(0.931 - 1.000)。然而,临床前评估发现了安全部署的障碍,包括外部验证时模型工作点的大幅偏移,以及骨骼异常(如佩吉特病)病例的错误率增加。
该模型在测试中表现优于放射科医生,并在外部验证中保持了性能,但在进一步测试中显示出一些意想不到的局限性。对人工智能模型进行全面的临床前评估,包括算法审核,即使在高性能的人工智能系统中也能揭示意想不到的潜在有害行为,这可为未来的临床试验和部署决策提供参考。
无。