GSB：利用有限训练样本的视觉转换器的分组叠加二值化。

GSB: Group superposition binarization for vision transformer with limited training samples.

机构信息

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China.

State Key Laboratory of Internet of Things for Smart City (SKL-IOTSC), University of Macau, 999078, Macao Special Administrative Region of China; Department of Computer and Information Science (CIS), University of Macau, 999078, Macao Special Administrative Region of China.

出版信息

Neural Netw. 2024 Apr;172:106133. doi: 10.1016/j.neunet.2024.106133. Epub 2024 Jan 18.

Vision Transformer (ViT) has performed remarkably in various computer vision tasks. Nonetheless, affected by the massive amount of parameters, ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. In addition, ViT generally demands heavy computing resources, which limit its deployment on resource-constrained devices. As a type of model-compression method, model binarization is potentially a good choice to solve the above problems. Compared with the full-precision one, the model with the binarization method replaces complex tensor multiplication with simple bit-wise binary operations and represents full-precision model parameters and activations with only 1-bit ones, which potentially solves the problem of model size and computational complexity, respectively. In this paper, we investigate a binarized ViT model. Empirically, we observe that the existing binarization technology designed for Convolutional Neural Networks (CNN) cannot migrate well to a ViT's binarization task. We also find that the decline of the accuracy of the binary ViT model is mainly due to the information loss of the Attention module and the Value vector. Therefore, we propose a novel model binarization technique, called Group Superposition Binarization (GSB), to deal with these issues. Furthermore, in order to further improve the performance of the binarization model, we have investigated the gradient calculation procedure in the binarization process and derived more proper gradient calculation equations for GSB to reduce the influence of gradient mismatch. Then, the knowledge distillation technique is introduced to alleviate the performance degradation caused by model binarization. Analytically, model binarization can limit the parameter's search space during parameter updates while training a model. Therefore, the binarization process can actually play an implicit regularization role and help solve the problem of overfitting in the case of insufficient training data. Experiments on three datasets with limited numbers of training samples demonstrate that the proposed GSB model achieves state-of-the-art performance among the binary quantization schemes and exceeds its full-precision counterpart on some indicators. Code and models are available at: https://github.com/IMRL/GSB-Vision-Transformer.

视觉转换器 (ViT) 在各种计算机视觉任务中表现出色。然而，由于大量的参数，ViT 通常会受到严重的过拟合问题的影响，而训练样本的数量相对有限。此外，ViT 通常需要大量的计算资源，这限制了它在资源有限的设备上的部署。作为一种模型压缩方法，模型二值化是解决上述问题的一个潜在选择。与全精度模型相比，采用二值化方法的模型用简单的位运算代替复杂的张量乘法，并将全精度模型参数和激活值表示为只有 1 位的值，这分别解决了模型大小和计算复杂度的问题。在本文中，我们研究了一种二值化 ViT 模型。从经验上看，我们观察到，为卷积神经网络 (CNN) 设计的现有二值化技术不能很好地迁移到 ViT 的二值化任务中。我们还发现，二进制 ViT 模型准确性的下降主要是由于注意力模块和值向量的信息丢失。因此，我们提出了一种新的模型二值化技术，称为分组叠加二值化 (GSB)，以解决这些问题。此外，为了进一步提高二值化模型的性能，我们研究了二值化过程中的梯度计算过程，并为 GSB 推导出更合适的梯度计算方程，以减少梯度不匹配的影响。然后，引入知识蒸馏技术来缓解模型二值化引起的性能下降。从分析上看，模型二值化可以在训练模型时限制参数更新过程中的参数搜索空间。因此，二值化过程实际上可以起到隐式正则化的作用，并有助于解决在训练数据不足的情况下的过拟合问题。在三个训练样本数量有限的数据集上的实验表明，所提出的 GSB 模型在二进制量化方案中达到了最先进的性能，并且在某些指标上超过了其全精度对应物。代码和模型可在 https://github.com/IMRL/GSB-Vision-Transformer 上获得。

GSB: Group superposition binarization for vision transformer with limited training samples.

机构信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献