IEEE Trans Image Process. 2022;31:4474-4489. doi: 10.1109/TIP.2022.3185487. Epub 2022 Jul 1.
Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of semantic asymmetry. The semantic asymmetry implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.
基于文本的视频分割旨在通过指定演员和其表演动作的文本查询来分割视频序列中的演员。由于语义不对称问题,之前的方法无法根据演员及其动作精细地将视频内容与文本查询对齐。语义不对称意味着在多模态融合过程中,两个模态包含不同数量的语义信息。为了解决这个问题,我们提出了一种新的演员和动作模块网络,该网络将演员和其动作分别定位在两个单独的模块中。具体来说,我们首先从视频和文本查询中学习与演员/动作相关的内容,然后以对称的方式对其进行匹配,以定位目标管。目标管包含所需的演员和动作,然后将其输入到全卷积网络中以预测演员的分割掩模。我们的方法还通过提出的时间提案聚合机制建立了跨多个帧的对象关联。这使得我们的方法能够有效地分割视频,并保持预测的时间一致性。整个模型允许联合学习演员-动作匹配和分割,并且在 A2D Sentences 和 J-HMDB Sentences 数据集上实现了单帧分割和全视频分割的最新性能。