用于时间动作定位的可学习特征增强框架

Learnable Feature Augmentation Framework for Temporal Action Localization.

作者信息

Tang Yepeng, Wang Weining, Zhang Chunjie, Liu Jing, Zhao Yao

出版信息

IEEE Trans Image Process. 2024;33:4002-4015. doi: 10.1109/TIP.2024.3413599. Epub 2024 Jun 28.

DOI:10.1109/TIP.2024.3413599

Abstract

Temporal action localization (TAL) has drawn much attention in recent years, however, the performance of previous methods is still far from satisfactory due to the lack of annotated untrimmed video data. To deal with this issue, we propose to improve the utilization of current data through feature augmentation. Given an input video, we first extract video features with pre-trained video encoders, and then randomly mask various semantic contents of video features to consider different views of video features. To avoid damaging important action-related semantic information, we further develop a learnable feature augmentation framework to generate better views of videos. In particular, a Mask-based Feature Augmentation Module (MFAM) is proposed. The MFAM has three advantages: 1) it captures the temporal and semantic relationships of original video features, 2) it generates masked features with indispensable action-related information, and 3) it randomly recycles some masked information to ensure diversity. Finally, we input the masked features and the original features into shared action detectors respectively, and perform action classification and localization jointly for model learning. The proposed framework can improve the robustness and generalization of action detectors by learning more and better views of videos. In the testing stage, the MFAM can be removed, which does not bring extra computational costs. Extensive experiments are conducted on four TAL benchmark datasets. Our proposed framework significantly improves different TAL models and achieves the state-of-the-art performances.

摘要

近年来，时态动作定位（TAL）备受关注，然而，由于缺乏带注释的未修剪视频数据，先前方法的性能仍远不能令人满意。为了解决这个问题，我们建议通过特征增强来提高当前数据的利用率。给定一个输入视频，我们首先使用预训练的视频编码器提取视频特征，然后随机掩盖视频特征的各种语义内容，以考虑视频特征的不同视图。为了避免损坏与动作相关的重要语义信息，我们进一步开发了一个可学习的特征增强框架，以生成更好的视频视图。特别地，提出了一种基于掩码的特征增强模块（MFAM）。MFAM具有三个优点：1）它捕获原始视频特征的时间和语义关系；2）它生成带有不可或缺的与动作相关信息的掩码特征；3）它随机循环一些掩码信息以确保多样性。最后，我们将掩码特征和原始特征分别输入到共享动作检测器中，并联合执行动作分类和定位以进行模型学习。所提出的框架可以通过学习更多更好的视频视图来提高动作检测器的鲁棒性和泛化能力。在测试阶段，可以移除MFAM，这不会带来额外的计算成本。我们在四个TAL基准数据集上进行了广泛的实验。我们提出的框架显著改进了不同的TAL模型，并取得了当前最优的性能。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于时间动作定位的可学习特征增强框架

Learnable Feature Augmentation Framework for Temporal Action Localization.

作者信息

出版信息

相似文献

用于时间动作定位的可学习特征增强框架

Learnable Feature Augmentation Framework for Temporal Action Localization.

作者信息

出版信息

相似文献