Department of Artificial Intelligence, Hanyang University, Seoul 04763, Korea.
Department of Computer Science and Engineering, Hanyang University, Seoul 04763, Korea.
Sensors (Basel). 2022 Jun 25;22(13):4817. doi: 10.3390/s22134817.
Video captioning via encoder-decoder structures is a successful sentence generation method. In addition, using various feature extraction networks for extracting multiple features to obtain multiple kinds of visual features in the encoding process is a standard method for improving model performance. Such feature extraction networks are weight-freezing states and are based on convolution neural networks (CNNs). However, these traditional feature extraction methods have some problems. First, when the feature extraction model is used in conjunction with freezing, additional learning of the feature extraction model is not possible by exploiting the backpropagation of the loss obtained from the video captioning training. Specifically, this blocks feature extraction models from learning more about spatial information. Second, the complexity of the model is further increased when multiple CNNs are used. Additionally, the author of Vision Transformers (ViTs) pointed out the inductive bias of CNN called the local receptive field. Therefore, we propose the full transformer structure that uses an end-to-end learning method for video captioning to overcome this problem. As a feature extraction model, we use a vision transformer (ViT) and propose feature extraction gates (FEGs) to enrich the input of the captioning model through that extraction model. Additionally, we design a universal encoder attraction (UEA) that uses all encoder layer outputs and performs self-attention on the outputs. The UEA is used to address the lack of information about the video's temporal relationship because our method uses only the appearance feature. We will evaluate our model against several recent models on two benchmark datasets and show its competitive performance on MSRVTT/MSVD datasets. We show that the proposed model performed captioning using only a single feature, but in some cases, it was better than the others, which used several features.
基于编解码器结构的视频字幕生成方法是一种成功的方法。此外,在编码过程中使用各种特征提取网络提取多种特征以获取多种视觉特征是提高模型性能的标准方法。这种特征提取网络是无权重的冻结状态,并且基于卷积神经网络(CNN)。然而,这些传统的特征提取方法存在一些问题。首先,当特征提取模型与冻结一起使用时,无法通过利用视频字幕训练获得的损失的反向传播来对特征提取模型进行额外的学习。具体来说,这阻止了特征提取模型学习更多关于空间信息。其次,当使用多个 CNN 时,模型的复杂性会进一步增加。此外,Vision Transformers(ViTs)的作者指出了 CNN 的归纳偏差,称为局部感受野。因此,我们提出了一种全转换器结构,该结构使用端到端学习方法进行视频字幕生成,以克服此问题。作为特征提取模型,我们使用视觉转换器(ViT),并通过该提取模型提出特征提取门(FEG)来丰富字幕模型的输入。此外,我们设计了一种通用的编码器吸引(UEA),该吸引器使用所有编码器层的输出,并对输出进行自注意力。UEA 用于解决视频时间关系信息不足的问题,因为我们的方法仅使用外观特征。我们将在两个基准数据集上针对几个最新模型评估我们的模型,并在 MSRVTT/MSVD 数据集上展示其具有竞争力的性能。我们表明,所提出的模型仅使用单个特征进行字幕生成,但在某些情况下,它比使用多个特征的其他模型更好。