Fernandes Fara A, Ge Mouzhi, Chaltikyan Georgi, Gerdes Martin W, Omlin Christian W
Department of Information and Communication Technology, University of Agder (UiA), 4879 Grimstad, Norway.
Faculty European Campus Rottal-Inn, Deggendorf Institute of Technology (DIT), 84347 Pfarrkirchen, Germany.
Dentomaxillofac Radiol. 2025 Feb 1;54(2):149-162. doi: 10.1093/dmfr/twae056.
To compare the performance of the convolutional neural network (CNN) with the vision transformer (ViT), and the gated multilayer perceptron (gMLP) in the classification of radiographic images of dental structures.
Retrospectively collected two-dimensional images derived from cone beam computed tomographic volumes were used to train CNN, ViT, and gMLP architectures as classifiers for four different cases. Cases selected for training the architectures were the classification of the radiographic appearance of maxillary sinuses, maxillary and mandibular incisors, the presence or absence of the mental foramen, and the positional relationship of the mandibular third molar to the inferior alveolar nerve canal. The performance metrics (sensitivity, specificity, precision, accuracy, and f1-score) and area under the curve (AUC)-receiver operating characteristic and precision-recall curves were calculated.
The ViT with an accuracy of 0.74-0.98, performed on par with the CNN model (accuracy 0.71-0.99) in all tasks. The gMLP displayed marginally lower performance (accuracy 0.65-0.98) as compared to the CNN and ViT. For certain tasks, the ViT outperformed the CNN. The AUCs ranged from 0.77 to 1.00 (CNN), 0.80 to 1.00 (ViT), and 0.73 to 1.00 (gMLP) for all of the four cases.
The ViT and gMLP exhibited comparable performance with the CNN (the current state-of-the-art). However, for certain tasks, there was a significant difference in the performance of the ViT and gMLP when compared to the CNN. This difference in model performance for various tasks proves that the capabilities of different architectures may be leveraged.
比较卷积神经网络(CNN)、视觉Transformer(ViT)和门控多层感知器(gMLP)在牙科结构放射图像分类中的性能。
回顾性收集来自锥形束计算机断层扫描容积的二维图像,用于训练CNN、ViT和gMLP架构,作为四种不同病例的分类器。选择用于训练架构的病例包括上颌窦、上颌和下颌切牙的放射影像外观分类、颏孔的有无以及下颌第三磨牙与下牙槽神经管的位置关系。计算性能指标(敏感性、特异性、精确性、准确性和F1分数)以及曲线下面积(AUC)-受试者操作特征曲线和精确召回率曲线。
ViT的准确率为0.74-0.98,在所有任务中的表现与CNN模型(准确率0.71-0.99)相当。与CNN和ViT相比,gMLP的性能略低(准确率0.65-0.98)。在某些任务中,ViT的表现优于CNN。所有四种情况的AUC范围为0.77至1.00(CNN)、0.80至1.00(ViT)和0.73至1.00(gMLP)。
ViT和gMLP与CNN(当前的最先进技术)表现出相当的性能。然而,在某些任务中,与CNN相比,ViT和gMLP的性能存在显著差异。不同架构在各种任务中的性能差异证明,可以利用不同架构的能力。