PiGLET：基于Transformer的语言表达像素级定位

PiGLET: Pixel-Level Grounding of Language Expressions With Transformers.

作者信息

Gonzalez Cristina, Ayobi Nicolas, Hernandez Isabela, Pont-Tuset Jordi, Arbelaez Pablo

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12206-12221. doi: 10.1109/TPAMI.2023.3286760. Epub 2023 Sep 5.

DOI:10.1109/TPAMI.2023.3286760

Abstract

This paper proposes Panoptic Narrative Grounding, a spatially fine and general formulation of the natural language visual grounding problem. We establish an experimental framework for the study of this new task, including new ground truth and metrics. We propose PiGLET, a novel multi-modal Transformer architecture to tackle the Panoptic Narrative Grounding task, and to serve as a stepping stone for future work. We exploit the intrinsic semantic richness in an image by including panoptic categories, and we approach visual grounding at a fine-grained level using segmentations. In terms of ground truth, we propose an algorithm to automatically transfer Localized Narratives annotations to specific regions in the panoptic segmentations of the MS COCO dataset. PiGLET achieves a performance of 63.2 absolute Average Recall points. By leveraging the rich language information on the Panoptic Narrative Grounding benchmark on MS COCO, PiGLET obtains an improvement of 0.4 Panoptic Quality points over its base method on the panoptic segmentation task. Finally, we demonstrate the generalizability of our method to other natural language visual grounding problems such as Referring Expression Segmentation. PiGLET is competitive with previous state-of-the-art in RefCOCO, RefCOCO+ and RefCOCOg.

摘要

本文提出了全景叙事定位，这是一种对自然语言视觉定位问题在空间上精细且通用的表述。我们建立了一个用于研究这项新任务的实验框架，包括新的地面真值和指标。我们提出了PiGLET，这是一种新颖的多模态Transformer架构，用于处理全景叙事定位任务，并作为未来工作的垫脚石。我们通过纳入全景类别来利用图像中固有的语义丰富性，并使用分割在细粒度级别上处理视觉定位。在地面真值方面，我们提出了一种算法，用于将本地化叙事注释自动转移到MS COCO数据集全景分割中的特定区域。PiGLET实现了63.2个绝对平均召回点的性能。通过利用MS COCO上全景叙事定位基准中丰富的语言信息，PiGLET在全景分割任务上比其基础方法获得了0.4个全景质量点的提升。最后，我们证明了我们的方法对其他自然语言视觉定位问题（如指代表达分割）的通用性。PiGLET在RefCOCO、RefCOCO+和RefCOCOg中与先前的最先进方法具有竞争力。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

PiGLET：基于Transformer的语言表达像素级定位

PiGLET: Pixel-Level Grounding of Language Expressions With Transformers.

作者信息

出版信息

相似文献

PiGLET：基于Transformer的语言表达像素级定位

PiGLET: Pixel-Level Grounding of Language Expressions With Transformers.

作者信息

出版信息

相似文献