Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina.
I-Medata AI Center, Tel Aviv Sourasky Medical Center, Tel Aviv-Yafo, Israel; Department of Pathology, Duke University Medical Center, Durham, North Carolina.
Mod Pathol. 2023 Jun;36(6):100129. doi: 10.1016/j.modpat.2023.100129. Epub 2023 Feb 13.
We examined the performance of deep learning models on the classification of thyroid fine-needle aspiration biopsies using microscope images captured in 2 ways: with a high-resolution scanner and with a mobile phone camera. Our training set consisted of images from 964 whole-slide images captured with a high-resolution scanner. Our test set consisted of 100 slides; 20 manually selected regions of interest (ROIs) from each slide were captured in 2 ways as mentioned above. Applying a baseline machine learning algorithm trained on scanner ROIs resulted in performance deterioration when applied to the smartphone ROIs (97.8% area under the receiver operating characteristic curve [AUC], CI = [95.4%, 100.0%] for scanner images vs 89.5% AUC, CI = [82.3%, 96.6%] for mobile images, P = .019). Preliminary analysis via histogram matching showed that the baseline model was overly sensitive to slight color variations in the images (specifically, to color differences between mobile and scanner images). Adding color augmentation during training reduces this sensitivity and narrows the performance gap between mobile and scanner images (97.6% AUC, CI = [95.0%, 100.0%] for scanner images vs 96.0% AUC, CI = [91.8%, 100.0%] for mobile images, P = .309), with both modalities on par with human pathologist performance (95.6% AUC, CI = [91.6%, 99.5%]) for malignancy prediction (P = .398 for pathologist vs scanner and P = .875 for pathologist vs mobile). For indeterminate cases (pathologist-assigned Bethesda category of 3, 4, or 5), color augmentations confer some improvement (88.3% AUC, CI = [73.7%, 100.0%] for the baseline model vs 96.2% AUC, CI = [90.9%, 100.0%] with color augmentations, P = .158). In addition, we found that our model's performance levels off after 15 ROIs, a promising indication that ROI data collection would not be time-consuming for our diagnostic system. Finally, we showed that the model has sensible Bethesda category (TBS) predictions (increasing risk malignancy rate with predicted TBS category, with 0% malignancy for predicted TBS 2 and 100% malignancy for TBS 6).
我们使用两种方法(高分辨率扫描仪和手机摄像头)检查了深度学习模型在甲状腺细针抽吸活检分类中的性能:使用高分辨率扫描仪拍摄的全幻灯片图像。我们的训练集由来自 964 张高分辨率扫描仪拍摄的全幻灯片图像组成。我们的测试集由 100 张幻灯片组成;从每张幻灯片中手动选择 20 个感兴趣区域(ROI),如上所述以两种方式捕获。应用于智能手机 ROI 的基线机器学习算法性能下降(扫描仪图像的接收器工作特征曲线下面积 [AUC] 为 97.8%,CI[95.4%,100.0%],移动图像的 AUC 为 89.5%,CI[82.3%,96.6%],P=0.019)。通过直方图匹配进行的初步分析表明,基线模型对图像中的轻微颜色变化过于敏感(特别是对移动和扫描仪图像之间的颜色差异敏感)。在训练过程中添加颜色增强可以降低这种敏感性,并缩小移动和扫描仪图像之间的性能差距(扫描仪图像的 AUC 为 97.6%,CI[95.0%,100.0%],移动图像的 AUC 为 96.0%,CI[91.8%,100.0%],P=0.309),两种模式与病理学家的表现相当(恶性预测的 AUC 为 95.6%,CI[91.6%,99.5%])(病理学家与扫描仪相比,P=0.398,病理学家与移动相比,P=0.875)。对于不确定病例(病理学家分配的贝塞斯达类别为 3、4 或 5),颜色增强可提供一些改进(基线模型的 AUC 为 88.3%,CI[73.7%,100.0%],具有颜色增强的 AUC 为 96.2%,CI[90.9%,100.0%],P=0.158)。此外,我们发现我们的模型在 15 个 ROI 后性能趋于平稳,这表明我们的诊断系统采集 ROI 数据不会很耗时。最后,我们表明该模型具有合理的贝塞斯达类别(TBS)预测(随着预测的 TBS 类别恶性率的增加,预测的 TBS 2 为 0%恶性,TBS 6 为 100%恶性)。