IEEE Trans Pattern Anal Mach Intell. 2022 Sep;44(9):4761-4775. doi: 10.1109/TPAMI.2021.3079993. Epub 2022 Aug 4.
Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a cross-modal progressive comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively. Our code is available at https://github.com/spyflying/CMPC-Refseg.
给定一个自然语言表达式和一个图像/视频,指称分割的目标是生成表达式的主题所描述的实体的像素级掩模。以前的方法通过在一个阶段内对视觉和语言模态进行隐式特征交互和融合来解决这个问题。然而,人类倾向于根据表达式中的信息词,以渐进的方式解决指称问题,即首先大致定位候选实体,然后区分目标实体。在本文中,我们提出了一种跨模态渐进理解(CMPC)方案,以有效地模拟人类行为,并将其实现为一个 CMPC-I(图像)模块和一个 CMPC-V(视频)模块,以改进指称图像和视频分割模型。对于图像数据,我们的 CMPC-I 模块首先使用实体和属性词感知表达式可能考虑的所有相关实体。然后,采用关系词通过空间图推理突出目标实体,并抑制其他不相关的实体。对于视频数据,我们的 CMPC-V 模块还基于 CMPC-I 利用动作词通过时间图推理突出与动作线索匹配的正确实体。除了 CMPC,我们还引入了一个简单而有效的文本引导特征交换(TGFE)模块,以在文本信息的指导下,整合视觉骨干中不同层次对应的推理多模态特征。通过这种方式,多层次特征可以相互交流,并根据文本上下文相互细化。将 CMPC-I 或 CMPC-V 与 TGFE 结合可以形成我们的图像或视频版本的指称分割框架,我们的框架在四个指称图像分割基准和三个指称视频分割基准上分别取得了新的最先进的性能。我们的代码可在 https://github.com/spyflying/CMPC-Refseg 获得。