Department of Ophthalmology, Cantonal Hospital Lucerne, Lucerne, Switzerland; Medical Retina Department, Moorfields Eye Hospital National Health Service Foundation Trust, London, UK.
National Institute of Health Research Biomedical Research Center, Moorfields Eye Hospital National Health Service Foundation Trust, and University College London Institute of Ophthalmology, London, UK; Medical Retina Department, Moorfields Eye Hospital National Health Service Foundation Trust, London, UK.
Lancet Digit Health. 2019 Sep;1(5):e232-e242. doi: 10.1016/S2589-7500(19)30108-6. Epub 2019 Sep 5.
Deep learning has the potential to transform health care; however, substantial expertise is required to train such models. We sought to evaluate the utility of automated deep learning software to develop medical image diagnostic classifiers by health-care professionals with no coding-and no deep learning-expertise.
We used five publicly available open-source datasets: retinal fundus images (MESSIDOR); optical coherence tomography (OCT) images (Guangzhou Medical University and Shiley Eye Institute, version 3); images of skin lesions (Human Against Machine [HAM] 10000), and both paediatric and adult chest x-ray (CXR) images (Guangzhou Medical University and Shiley Eye Institute, version 3 and the National Institute of Health [NIH] dataset, respectively) to separately feed into a neural architecture search framework, hosted through Google Cloud AutoML, that automatically developed a deep learning architecture to classify common diseases. Sensitivity (recall), specificity, and positive predictive value (precision) were used to evaluate the diagnostic properties of the models. The discriminative performance was assessed using the area under the precision recall curve (AUPRC). In the case of the deep learning model developed on a subset of the HAM10000 dataset, we did external validation using the Edinburgh Dermofit Library dataset.
Diagnostic properties and discriminative performance from internal validations were high in the binary classification tasks (sensitivity 73·3-97·0%; specificity 67-100%; AUPRC 0·87-1·00). In the multiple classification tasks, the diagnostic properties ranged from 38% to 100% for sensitivity and from 67% to 100% for specificity. The discriminative performance in terms of AUPRC ranged from 0·57 to 1·00 in the five automated deep learning models. In an external validation using the Edinburgh Dermofit Library dataset, the automated deep learning model showed an AUPRC of 0·47, with a sensitivity of 49% and a positive predictive value of 52%.
All models, except the automated deep learning model trained on the multilabel classification task of the NIH CXR14 dataset, showed comparable discriminative performance and diagnostic properties to state-of-the-art performing deep learning algorithms. The performance in the external validation study was low. The quality of the open-access datasets (including insufficient information about patient flow and demographics) and the absence of measurement for precision, such as confidence intervals, constituted the major limitations of this study. The availability of automated deep learning platforms provide an opportunity for the medical community to enhance their understanding in model development and evaluation. Although the derivation of classification models without requiring a deep understanding of the mathematical, statistical, and programming principles is attractive, comparable performance to expertly designed models is limited to more elementary classification tasks. Furthermore, care should be placed in adhering to ethical principles when using these automated models to avoid discrimination and causing harm. Future studies should compare several application programming interfaces on thoroughly curated datasets.
National Institute for Health Research and Moorfields Eye Charity.
深度学习有潜力改变医疗保健行业;然而,需要大量专业知识来训练此类模型。我们试图评估自动化深度学习软件的实用性,以帮助没有编码和深度学习专业知识的医疗保健专业人员开发医学图像诊断分类器。
我们使用了五个公开的开源数据集:视网膜眼底图像(MESSIDOR);光学相干断层扫描(OCT)图像(广州医科大学和 Shiley 眼科研究所,版本 3);皮肤病变图像(Human Against Machine [HAM] 10000),以及儿科和成人胸部 X 射线(CXR)图像(广州医科大学和 Shiley 眼科研究所,版本 3 和美国国立卫生研究院 [NIH]数据集),分别输入到一个通过 Google Cloud AutoML 托管的神经架构搜索框架中,该框架自动开发了一种深度学习架构来对常见疾病进行分类。使用敏感性(召回率)、特异性和阳性预测值(精度)来评估模型的诊断性能。使用精度召回曲线下的面积(AUPRC)来评估判别性能。在针对 HAM10000 数据集子集开发的深度学习模型的情况下,我们使用爱丁堡 Dermofit 库数据集进行了外部验证。
在二元分类任务中,内部验证的诊断性能和判别性能均较高(敏感性 73.3-97.0%;特异性 67-100%;AUPRC 0.87-1.00)。在多项分类任务中,敏感性的诊断性能范围为 38%至 100%,特异性的诊断性能范围为 67%至 100%。在五个自动深度学习模型中,AUPRC 的判别性能范围为 0.57 至 1.00。在使用爱丁堡 Dermofit 库数据集进行的外部验证中,自动深度学习模型的 AUPRC 为 0.47,敏感性为 49%,阳性预测值为 52%。
除了在 NIH CXR14 数据集的多标签分类任务上训练的自动深度学习模型外,所有模型的判别性能和诊断性能都与表现最佳的深度学习算法相当。外部验证研究的性能较低。本研究的主要局限性在于开放获取数据集的质量(包括患者流动和人口统计学信息不足)以及缺乏精度测量(如置信区间)。自动化深度学习平台的可用性为医疗界提供了一个机会,使其能够提高对模型开发和评估的理解。尽管无需深入了解数学、统计学和编程原理即可开发分类模型具有吸引力,但与专家设计的模型相比,其性能仅限于更基本的分类任务。此外,在使用这些自动化模型时应注意遵守伦理原则,以避免歧视和造成伤害。未来的研究应在经过精心整理的数据集上比较几种应用程序编程接口。
英国国家卫生研究院和 Moorfields 眼科慈善基金会。