Gao Junyu, Zhang Tianzhu, Xu Changsheng
IEEE Trans Pattern Anal Mach Intell. 2021 Oct;43(10):3476-3491. doi: 10.1109/TPAMI.2020.2985708. Epub 2021 Sep 2.
With the explosive growth of video categories, zero-shot learning (ZSL) in video classification has become a promising research direction in pattern analysis and machine learning. Based on some auxiliary information such as word embeddings and attributes, the key to a robust ZSL method is to transfer the learned knowledge from seen classes to unseen classes, which requires relationship modeling between these concepts (e.g., categories and attributes). However, most existing approaches ignore to model the explicit relationships in an end-to-end manner, resulting in low effectiveness of knowledge transfer. To tackle this problem, we reconsider the video ZSL task as a task-driven message passing process to jointly enjoy several merits including alleviated heterogeneity gap, low domain shift, and robust temporal modeling. Specifically, we propose a prototype-sample GNN (PS-GNN) consisting of a prototype branch and a sample branch to directly and adaptively model all the relationships between category-attribute, category-category, and attribute-attribute. The prototype branch aims to learn robust representations of video categories, which takes as input a set of word-embedding vectors corresponding to the concepts. The sample branch is designed to generate features of a video sample by leveraging its object semantics. With the co-adaption and cooperation between both branches, a unified and robust ZSL framework is achieved. Extensive experiments strongly evidence that PS-GNN obtains favorable performance on five popular video benchmarks consistently.
随着视频类别的爆炸式增长,视频分类中的零样本学习(ZSL)已成为模式分析和机器学习中一个有前途的研究方向。基于诸如词嵌入和属性等一些辅助信息,一种强大的ZSL方法的关键在于将从已见类别中学到的知识转移到未见类别,这需要对这些概念(如类别和属性)之间的关系进行建模。然而,大多数现有方法忽略了以端到端的方式对显式关系进行建模,导致知识转移的效率低下。为了解决这个问题,我们将视频ZSL任务重新视为一个任务驱动的消息传递过程,以共同具备几个优点,包括减轻异质性差距、低领域转移和强大的时间建模。具体而言,我们提出了一种由原型分支和样本分支组成的原型-样本图神经网络(PS-GNN),以直接和自适应地对类别-属性、类别-类别和属性-属性之间的所有关系进行建模。原型分支旨在学习视频类别的强大表示,它将与概念对应的一组词嵌入向量作为输入。样本分支旨在通过利用视频样本的对象语义来生成其特征。通过两个分支之间的共同适应和协作,实现了一个统一且强大的ZSL框架。大量实验有力地证明,PS-GNN在五个流行的视频基准测试中始终获得良好的性能。