Suppr超能文献

用于视频场景理解的以对象为中心的表示学习

Object-Centric Representation Learning for Video Scene Understanding.

作者信息

Zhou Yi, Zhang Hui, Park Seung-In, Yoo ByungIn, Qi Xiaojuan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8410-8423. doi: 10.1109/TPAMI.2024.3401409. Epub 2024 Nov 6.

Abstract

Depth-aware Video Panoptic Segmentation (DVPS) is a challenging task that requires predicting the semantic class and 3D depth of each pixel in a video, while also segmenting and consistently tracking objects across frames. Predominant methodologies treat this as a multi-task learning problem, tackling each constituent task independently, thus restricting their capacity to leverage interrelationships amongst tasks and requiring parameter tuning for each task. To surmount these constraints, we present Slot-IVPS, a new approach employing an object-centric model to acquire unified object representations, thereby facilitating the model's ability to simultaneously capture semantic and depth information. Specifically, we introduce a novel representation, Integrated Panoptic Slots (IPS), to capture both semantic and depth information for all panoptic objects within a video, encompassing background semantics and foreground instances. Subsequently, we propose an integrated feature generator and enhancer to extract depth-aware features, alongside the Integrated Video Panoptic Retriever (IVPR), which iteratively retrieves spatial-temporal coherent object features and encodes them into IPS. The resulting IPS can be effortlessly decoded into an array of video outputs, including depth maps, classifications, masks, and object instance IDs. We undertake comprehensive analyses across four datasets, attaining state-of-the-art performance in both Depth-aware Video Panoptic Segmentation and Video Panoptic Segmentation tasks.

摘要

深度感知视频全景分割(DVPS)是一项具有挑战性的任务,它需要预测视频中每个像素的语义类别和3D深度,同时还要跨帧分割并持续跟踪物体。主流方法将此视为一个多任务学习问题,独立处理每个组成任务,从而限制了它们利用任务间相互关系的能力,并且需要对每个任务进行参数调整。为了克服这些限制,我们提出了Slot-IVPS,这是一种采用以对象为中心的模型来获取统一对象表示的新方法,从而促进模型同时捕获语义和深度信息的能力。具体来说,我们引入了一种新颖的表示,即集成全景槽(IPS),以捕获视频中所有全景对象的语义和深度信息,包括背景语义和前景实例。随后,我们提出了一个集成特征生成器和增强器来提取深度感知特征,以及集成视频全景检索器(IVPR),它迭代地检索时空连贯的对象特征并将它们编码到IPS中。生成的IPS可以轻松解码为一系列视频输出,包括深度图、分类、掩码和对象实例ID。我们在四个数据集上进行了全面分析,在深度感知视频全景分割和视频全景分割任务中均取得了领先的性能。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验