Oloruntoba Ayooluwatomiwa I, Vestergaard Tine, Nguyen Toan D, Yu Zhen, Sashindranath Maithili, Betz-Stablein Brigid, Soyer H Peter, Ge Zongyuan, Mar Victoria
School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria, Australia.
Monash Medical Artificial Intelligence, Monash University, Clayton, Melbourne, Australia.
JMIR Dermatol. 2022 Sep 12;5(3):e35150. doi: 10.2196/35150.
Convolutional neural networks (CNNs) are a type of artificial intelligence that shows promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets with varying image capture standardization.
The aim of our study was to use CNN models with the same architecture-trained on image sets acquired with either the same image capture device and technique (standardized) or with varied devices and capture techniques (nonstandardized)-and test variability in performance when classifying skin cancer images in different populations.
In all, 3 CNNs with the same architecture were trained. CNN nonstandardized (CNN-NS) was trained on 25,331 images taken from the International Skin Imaging Collaboration (ISIC) using different image capture devices. CNN standardized (CNN-S) was trained on 177,475 MoleMap images taken with the same capture device, and CNN standardized number 2 (CNN-S2) was trained on a subset of 25,331 standardized MoleMap images (matched for number and classes of training images to CNN-NS). These 3 models were then tested on 3 external test sets: 569 Danish images, the publicly available ISIC 2020 data set consisting of 33,126 images, and The University of Queensland (UQ) data set of 422 images. Primary outcome measures were sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Teledermatology assessments available for the Danish data set were used to determine model performance compared to teledermatologists.
When tested on the 569 Danish images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889) and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; standardized models), with both outperforming CNN-NS (nonstandardized model; P=.001 and P=.009, respectively), which achieved an AUROC of 0.759 (95% CI 0.722-0.794). When tested on 2 additional data sets (ISIC 2020 and UQ), CNN-S (P<.001 and P<.001, respectively) and CNN-S2 (P=.08 and P=.35, respectively) still outperformed CNN-NS. When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists on the Danish data set, the models' resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (sensitivity: P=.10; specificity: P=.053). Performance across all CNN models as well as teledermatologists was influenced by image quality.
CNNs trained on standardized images had improved performance and, therefore, greater generalizability in skin cancer classification when applied to unseen data sets. This finding is an important consideration for future algorithm development, regulation, and approval.
卷积神经网络(CNN)是一种人工智能,有望成为皮肤癌诊断的辅助工具。然而,大多数CNN是使用回顾性图像数据集进行训练的,这些数据集的图像采集标准化程度各不相同。
我们研究的目的是使用具有相同架构的CNN模型,这些模型在使用相同图像采集设备和技术(标准化)或不同设备和采集技术(非标准化)获取的图像集上进行训练,并测试在对不同人群的皮肤癌图像进行分类时性能的差异。
总共训练了3个具有相同架构的CNN。非标准化CNN(CNN-NS)在从国际皮肤成像协作组织(ISIC)获取的25331张使用不同图像采集设备拍摄的图像上进行训练。标准化CNN(CNN-S)在使用相同采集设备拍摄的177475张MoleMap图像上进行训练,标准化CNN 2(CNN-S2)在25331张标准化MoleMap图像的子集上进行训练(训练图像的数量和类别与CNN-NS匹配)。然后,这3个模型在3个外部测试集上进行测试:569张丹麦图像、由33126张图像组成的公开可用的ISIC 2020数据集以及昆士兰大学(UQ)的422张图像数据集。主要结局指标为敏感性、特异性和受试者操作特征曲线下面积(AUROC)。利用丹麦数据集可获得的远程皮肤病学评估来确定与远程皮肤科医生相比的模型性能。
在对569张丹麦图像进行测试时,CNN-S的AUROC为0.861(95%CI 0.830-0.889),CNN-S2的AUROC为0.831(95%CI 0.798-0.861;标准化模型),两者均优于CNN-NS(非标准化模型;P=0.001和P=0.009),其AUROC为0.759(95%CI 0.722-0.794)。在另外2个数据集(ISIC 2020和UQ)上进行测试时,CNN-S(分别为P<0.001和P<0.001)和CNN-S2(分别为P=0.08和P=0.35)仍然优于CNN-NS。当CNN与丹麦数据集上远程皮肤科医生的平均敏感性和特异性相匹配时,远程皮肤科医生的敏感性和特异性超过了模型的结果。然而,与CNN-S相比,差异无统计学意义(敏感性:P=0.10;特异性:P=0.053)。所有CNN模型以及远程皮肤科医生的性能均受图像质量影响。
在标准化图像上训练的CNN在应用于未见过的数据集时,在皮肤癌分类方面具有更好的性能,因此具有更高的通用性。这一发现是未来算法开发、监管和批准的重要考虑因素。