Wu Ranwan, Xiang Tian-Zhu, Xie Guo-Sen, Gao Rongrong, Shu Xiangbo, Zhao Fang, Shao Ling
IEEE Trans Image Process. 2025;34:5341-5354. doi: 10.1109/TIP.2025.3587579.
Referring camouflaged object detection (Ref-COD) is a recently proposed task, aiming to segment specified camouflaged objects by leveraging visual reference, i.e., a small set of referring images with salient target objects. Ref-COD poses a considerable challenge due to the difficulty of discerning camouflaged objects from their highly similar backgrounds, as well as the significant feature differences between the camouflaged objects and the provided visual reference. To tackle the above dilemma, we propose a novel uncertainty-aware transformer for the Ref-COD task, termed UAT. UAT first utilizes a cross-attention mechanism to align and integrate visual reference to guide camouflaged feature learning, and then models dependencies between patches in a probabilistic manner to learn predictive uncertainty and excavate discriminative camouflaged features. Specifically, we first design a referring feature aggregation (RFA) module to align and incorporate referring features with camouflaged features, guiding targeted specific feature learning within the feature space of camouflaged images. Then, to enhance multi-level feature extraction, we develop a cross-attention encoder (CAE) to integrate global information and multi-scale semantics between adjacent layers to excavate critical camouflage cues. More importantly, we propose a transformer probabilistic decoder (TPD) to model the dependencies between patches as Gaussian random variables to capture uncertainty-aware camouflaged features. Extensive experiments on the golden Ref-COD benchmark demonstrate the superiority of UAT over existing state-of-the-art competitors. The proposed UAT also achieves competitive performance on several conventional COD datasets, further demonstrating its scalability. The source code is available at https://github.com/CVL-hub/UAT.
参考伪装物体检测(Ref-COD)是一项最近提出的任务,旨在通过利用视觉参考(即一小部分带有显著目标物体的参考图像)来分割指定的伪装物体。由于难以从高度相似的背景中辨别伪装物体,以及伪装物体与所提供视觉参考之间存在显著的特征差异,Ref-COD带来了相当大的挑战。为了解决上述困境,我们提出了一种用于Ref-COD任务的新型不确定性感知Transformer,称为UAT。UAT首先利用交叉注意力机制对齐和整合视觉参考,以指导伪装特征学习,然后以概率方式对补丁之间的依赖关系进行建模,以学习预测不确定性并挖掘有区分性的伪装特征。具体来说,我们首先设计了一个参考特征聚合(RFA)模块,将参考特征与伪装特征对齐并合并,在伪装图像的特征空间内指导有针对性的特定特征学习。然后,为了增强多级特征提取,我们开发了一个交叉注意力编码器(CAE),以整合相邻层之间的全局信息和多尺度语义,挖掘关键的伪装线索。更重要的是,我们提出了一个Transformer概率解码器(TPD),将补丁之间的依赖关系建模为高斯随机变量,以捕获不确定性感知的伪装特征。在黄金Ref-COD基准上进行的大量实验证明了UAT优于现有的最先进竞争对手。所提出的UAT在几个传统的COD数据集上也取得了有竞争力的性能,进一步证明了其可扩展性。源代码可在https://github.com/CVL-hub/UAT获取。