Department of Dermatology, I Dermatology Clinic, Seoul, Korea.
Department of Dermatology, Severance Hospital, Yonsei University College of Medicine, Seoul, Korea.
PLoS Med. 2020 Nov 25;17(11):e1003381. doi: 10.1371/journal.pmed.1003381. eCollection 2020 Nov.
The diagnostic performance of convolutional neural networks (CNNs) for diagnosing several types of skin neoplasms has been demonstrated as comparable with that of dermatologists using clinical photography. However, the generalizability should be demonstrated using a large-scale external dataset that includes most types of skin neoplasms. In this study, the performance of a neural network algorithm was compared with that of dermatologists in both real-world practice and experimental settings.
To demonstrate generalizability, the skin cancer detection algorithm (https://rcnn.modelderm.com) developed in our previous study was used without modification. We conducted a retrospective study with all single lesion biopsied cases (43 disorders; 40,331 clinical images from 10,426 cases: 1,222 malignant cases and 9,204 benign cases); mean age (standard deviation [SD], 52.1 [18.3]; 4,701 men [45.1%]) were obtained from the Department of Dermatology, Severance Hospital in Seoul, Korea between January 1, 2008 and March 31, 2019. Using the external validation dataset, the predictions of the algorithm were compared with the clinical diagnoses of 65 attending physicians who had recorded the clinical diagnoses with thorough examinations in real-world practice. In addition, the results obtained by the algorithm for the data of randomly selected batches of 30 patients were compared with those obtained by 44 dermatologists in experimental settings; the dermatologists were only provided with multiple images of each lesion, without clinical information. With regard to the determination of malignancy, the area under the curve (AUC) achieved by the algorithm was 0.863 (95% confidence interval [CI] 0.852-0.875), when unprocessed clinical photographs were used. The sensitivity and specificity of the algorithm at the predefined high-specificity threshold were 62.7% (95% CI 59.9-65.1) and 90.0% (95% CI 89.4-90.6), respectively. Furthermore, the sensitivity and specificity of the first clinical impression of 65 attending physicians were 70.2% and 95.6%, respectively, which were superior to those of the algorithm (McNemar test; p < 0.0001). The positive and negative predictive values of the algorithm were 45.4% (CI 43.7-47.3) and 94.8% (CI 94.4-95.2), respectively, whereas those of the first clinical impression were 68.1% and 96.0%, respectively. In the reader test conducted using images corresponding to batches of 30 patients, the sensitivity and specificity of the algorithm at the predefined threshold were 66.9% (95% CI 57.7-76.0) and 87.4% (95% CI 82.5-92.2), respectively. Furthermore, the sensitivity and specificity derived from the first impression of 44 of the participants were 65.8% (95% CI 55.7-75.9) and 85.7% (95% CI 82.4-88.9), respectively, which are values comparable with those of the algorithm (Wilcoxon signed-rank test; p = 0.607 and 0.097). Limitations of this study include the exclusive use of high-quality clinical photographs taken in hospitals and the lack of ethnic diversity in the study population.
Our algorithm could diagnose skin tumors with nearly the same accuracy as a dermatologist when the diagnosis was performed solely with photographs. However, as a result of limited data relevancy, the performance was inferior to that of actual medical examination. To achieve more accurate predictive diagnoses, clinical information should be integrated with imaging information.
卷积神经网络(CNN)在诊断多种皮肤肿瘤方面的诊断性能已被证明与皮肤科医生使用临床摄影相当。然而,应该使用包括大多数类型皮肤肿瘤的大型外部数据集来证明其泛化能力。在这项研究中,我们比较了神经网络算法在真实世界和实验环境中的性能与皮肤科医生的表现。
为了证明其泛化能力,我们使用了之前研究中开发的皮肤癌检测算法(https://rcnn.modelderm.com),并未进行任何修改。我们进行了一项回顾性研究,使用了从韩国首尔 Severance 医院皮肤科获得的所有单个病变活检病例(43 种疾病;10426 例患者的 40331 张临床图像:1222 例恶性病例和 9204 例良性病例;平均年龄(标准差 [SD])为 52.1(18.3)岁;男性 4701 例[占 45.1%])。使用外部验证数据集,我们将算法的预测结果与 65 名主治医生的临床诊断进行了比较,这些医生在真实世界的实践中进行了彻底的检查。此外,我们还比较了算法对随机选择的 30 名患者数据的结果与 44 名皮肤科医生在实验环境中的结果;皮肤科医生只提供了每个病变的多张图像,没有临床信息。关于恶性肿瘤的确定,当使用未经处理的临床照片时,算法的曲线下面积(AUC)为 0.863(95%置信区间 [CI] 0.852-0.875)。算法在预设高特异性阈值下的敏感性和特异性分别为 62.7%(95%CI 59.9-65.1)和 90.0%(95%CI 89.4-90.6)。此外,65 名主治医生的第一临床印象的敏感性和特异性分别为 70.2%和 95.6%,优于算法(McNemar 检验;p<0.0001)。算法的阳性预测值和阴性预测值分别为 45.4%(95%CI 43.7-47.3)和 94.8%(95%CI 94.4-95.2),而第一临床印象的阳性预测值和阴性预测值分别为 68.1%和 96.0%。在使用对应于 30 名患者批次的图像进行的读者测试中,算法在预设阈值下的敏感性和特异性分别为 66.9%(95%CI 57.7-76.0)和 87.4%(95%CI 82.5-92.2)。此外,44 名参与者的第一印象得出的敏感性和特异性分别为 65.8%(95%CI 55.7-75.9)和 85.7%(95%CI 82.4-88.9),与算法的值相当(Wilcoxon 符号秩检验;p=0.607 和 0.097)。这项研究的局限性包括仅使用医院拍摄的高质量临床照片,以及研究人群缺乏种族多样性。
当仅使用照片进行诊断时,我们的算法可以与皮肤科医生一样准确地诊断皮肤肿瘤。然而,由于数据相关性有限,其性能不如实际的医学检查。为了实现更准确的预测诊断,应该将临床信息与成像信息相结合。