从视觉Transformer 集成中提取知识，提高乳腺超声分类的性能。

Distilling Knowledge From an Ensemble of Vision Transformers for Improved Classification of Breast Ultrasound.

机构信息

Weill Cornell Medicine, New York, NY 10021.

Dalio Institute of Cardiovascular Imaging, Department of Radiology, Weill Cornell Medicine, New York, New York.

出版信息

Acad Radiol. 2024 Jan;31(1):104-120. doi: 10.1016/j.acra.2023.08.006. Epub 2023 Sep 2.

DOI:10.1016/j.acra.2023.08.006

PMID:37666747

Abstract

RATIONALE AND OBJECTIVES

To develop a deep learning model for the automated classification of breast ultrasound images as benign or malignant. More specifically, the application of vision transformers, ensemble learning, and knowledge distillation is explored for breast ultrasound classification.

MATERIALS AND METHODS

Single view, B-mode ultrasound images were curated from the publicly available Breast Ultrasound Image (BUSI) dataset, which has categorical ground truth labels (benign vs malignant) assigned by radiologists and malignant cases confirmed by biopsy. The performance of vision transformers (ViT) is compared to convolutional neural networks (CNN), followed by a comparison between supervised, self-supervised, and randomly initialized ViT. Subsequently, the ensemble of 10 independently trained ViT, where the ensemble model is the unweighted average of the output of each individual model is compared to the performance of each ViT alone. Finally, we train a single ViT to emulate the ensembled ViT using knowledge distillation.

RESULTS

On this dataset that was trained using five-fold cross validation, ViT outperforms CNN, while self-supervised ViT outperform supervised and randomly initialized ViT. The ensemble model achieves an area under the receiver operating characteristics curve (AuROC) and area under the precision recall curve (AuPRC) of 0.977 and 0.965 on the test set, outperforming the average AuROC and AuPRC of the independently trained ViTs (0.958 ± 0.05 and 0.931 ± 0.016). The distilled ViT achieves an AuROC and AuPRC of 0.972 and 0.960.

CONCLUSION

Both transfer learning and ensemble learning can each offer increased performance independently and can be sequentially combined to collectively improve the performance of the final model. Furthermore, a single vision transformer can be trained to match the performance of an ensemble of a set of vision transformers using knowledge distillation.

摘要

背景与目的

开发一种深度学习模型，用于自动对乳腺超声图像进行良性或恶性分类。更具体地说，探索了视觉转换器、集成学习和知识蒸馏在乳腺超声分类中的应用。

材料与方法

从公开的乳腺超声图像（BUSI）数据集获取单视图、B 模式超声图像，该数据集具有由放射科医生分配的类别真实标签（良性与恶性），并且通过活检确认恶性病例。比较了视觉转换器（ViT）与卷积神经网络（CNN）的性能，然后比较了监督、自监督和随机初始化 ViT。随后，将 10 个独立训练的 ViT 进行集成，其中集成模型是每个单独模型输出的无权重平均值，与每个 ViT 的性能进行比较。最后，我们使用知识蒸馏训练单个 ViT 来模拟集成的 ViT。

结果

在使用五折交叉验证进行训练的这个数据集上，ViT 优于 CNN，而自监督 ViT 优于监督和随机初始化 ViT。集成模型在测试集上的受试者工作特征曲线下面积（AuROC）和精度召回曲线下面积（AuPRC）分别为 0.977 和 0.965，优于独立训练的 ViT 的平均 AuROC 和 AuPRC（0.958 ± 0.05 和 0.931 ± 0.016）。蒸馏后的 ViT 的 AuROC 和 AuPRC 分别为 0.972 和 0.960。