Zhu Guangming, Zhang Liang, Yang Lu, Mei Lin, Shah Syed Afaq Ali, Bennamoun Mohammed, Shen Peiyi
IEEE Trans Neural Netw Learn Syst. 2019 Jun 28. doi: 10.1109/TNNLS.2019.2919764.
Convolutional long short-term memory (ConvLSTM) networks have been widely used for action/gesture recognition, and different attention mechanisms have also been embedded into ConvLSTM networks. This paper explores the redundancy of spatial convolutions and the effects of the attention mechanism in ConvLSTM, based on our previous gesture recognition architectures that combine the 3-D convolutional neural network (CNN) and ConvLSTM. Depthwise separable, group, and shuffle convolutions are used to replace the convolutional structures in ConvLSTM for the redundancy analysis. In addition, four ConvLSTM variants are derived for attention analysis: 1) by removing the convolutional structures of the three gates in ConvLSTM; 2) by applying the attention mechanism on the ConvLSTM input; and 3) by reconstructing the input and 4) output gates with the modified channelwise attention mechanism. Evaluation results demonstrate that the spatial convolutions in the three gates scarcely contribute to the spatiotemporal feature fusion and that the attention mechanisms embedded into the input and output gates cannot improve the feature fusion. In other words, ConvLSTM mainly contributes to the temporal fusion along with the recurrent steps to learn long-term spatiotemporal features when taking spatial or spatiotemporal features as input. A new LSTM variant is derived on this basis in which the convolutional structures are embedded only into the input-to-state transition of LSTM. The code of the LSTM variants is publicly available.\footnotehttps://github.com/GuangmingZhu/ConvLSTMForGR.
卷积长短期记忆(ConvLSTM)网络已被广泛用于动作/手势识别,并且不同的注意力机制也已被嵌入到ConvLSTM网络中。基于我们之前将三维卷积神经网络(CNN)和ConvLSTM相结合的手势识别架构,本文探讨了ConvLSTM中空间卷积的冗余性以及注意力机制的影响。使用深度可分离卷积、分组卷积和逐点卷积来替换ConvLSTM中的卷积结构以进行冗余分析。此外,为了进行注意力分析,推导了四种ConvLSTM变体:1)通过去除ConvLSTM中三个门的卷积结构;2)通过在ConvLSTM输入上应用注意力机制;3)通过使用修改后的通道注意力机制重建输入和4)输出门。评估结果表明,三个门中的空间卷积对时空特征融合几乎没有贡献,并且嵌入到输入门和输出门中的注意力机制不能改善特征融合。换句话说,当将空间或时空特征作为输入时,ConvLSTM主要通过循环步骤对时间融合做出贡献以学习长期时空特征。在此基础上推导了一种新的LSTM变体,其中卷积结构仅嵌入到LSTM的输入到状态转换中。LSTM变体的代码可公开获取。\footnotehttps://github.com/GuangmingZhu/ConvLSTMForGR。