Suppr超能文献

CAVER:用于双模态显著目标检测的跨模态视图混合变换器

CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection.

作者信息

Pang Youwei, Zhao Xiaoqi, Zhang Lihe, Lu Huchuan

出版信息

IEEE Trans Image Process. 2023;32:892-904. doi: 10.1109/TIP.2023.3234702. Epub 2023 Jan 23.

Abstract

Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed cross-modal view-mixed transformer (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components.

摘要

现有的大多数双模态(RGB-D和RGB-T)显著目标检测方法都利用卷积操作,并构建复杂的交织融合结构来实现跨模态信息整合。卷积操作固有的局部连通性将基于卷积的方法的性能限制在一定水平。在这项工作中,我们从全局信息对齐和转换的角度重新思考这些任务。具体而言,所提出的跨模态视图混合变换器(CAVER)级联多个跨模态集成单元,以构建基于自上而下变换器的信息传播路径。CAVER将多尺度和多模态特征集成视为基于新型视图混合注意力机制构建的序列到序列的上下文传播和更新过程。此外,考虑到相对于输入令牌数量的二次复杂度,我们设计了一种无参数的逐补丁令牌重新嵌入策略来简化操作。在RGB-D和RGB-T SOD数据集上的大量实验结果表明,当配备所提出的组件时,这种简单的双流编码器-解码器框架可以超越最近的先进方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验