Suppr超能文献

迈向无需伪装标注的真正零样本伪装物体分割

Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations.

作者信息

Lei Cheng, Fan Jie, Li Xinran, Xiang Tian-Zhu, Li Ao, Zhu Ce, Zhang Le

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Sep 10;PP. doi: 10.1109/TPAMI.2025.3600461.

Abstract

Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?", we propose an affirmative solution. We analyze the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework. Our findings reveal that while transformer models for salient object segmentation (SOS) prioritize global features in their attention mechanisms, camouflaged object segmentation exhibits both global and local attention biases. Based on these findings, we design a framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS. This enables efficient zero-shot transfer for COS. Specifically, We incorporate an Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM encoder captures essential local features, while the PEFT module learns global and semantic representations from SOS datasets. To further enhance semantic granularity, we leverage the M-LLM to generate caption embeddings conditioned on visual cues, which are meticulously aligned with multi-scale visual features via MFA. This alignment enables precise interpretation of complex semantic contexts. Moreover, we introduce a learnable codebook to represent the M-LLM during inference, significantly reducing computational demands while maintaining performance. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with $F_{\beta }^{w}$ scores of 72.9% on CAMO and 71.7% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Additionally, our method excels in polyp segmentation, and underwater scene segmentation, outperforming challenging baselines in both zero-shot and supervised settings, thereby highlighting its potential for broad applicability in diverse segmentation tasks.

摘要

伪装物体分割(COS)由于标注数据稀缺而面临重大挑战,其中细致的像素级标注既耗费人力又成本高昂,主要原因在于物体与背景之间复杂的边界。针对核心问题“能否在无需对任何伪装物体进行人工标注的情况下以零样本方式有效实现COS?”,我们提出了一个肯定的解决方案。我们分析了针对伪装物体学习到的注意力模式,并引入了一个强大的零样本COS框架。我们的研究结果表明,虽然用于显著物体分割(SOS)的Transformer模型在其注意力机制中优先考虑全局特征,但伪装物体分割同时表现出全局和局部注意力偏差。基于这些发现,我们设计了一个框架,该框架在适应COS固有的局部模式偏差的同时,融入全局注意力模式以及源自SOS的广泛语义特征空间。这实现了COS的高效零样本迁移。具体而言,我们纳入了一个基于掩码图像建模(MIM)的图像编码器,该编码器针对参数高效微调(PEFT)进行了优化,一个多模态大语言模型(M-LLM),以及一个多尺度细粒度对齐(MFA)机制。MIM编码器捕捉基本的局部特征,而PEFT模块从SOS数据集中学习全局和语义表示。为了进一步提高语义粒度,我们利用M-LLM根据视觉线索生成字幕嵌入,这些嵌入通过MFA与多尺度视觉特征进行精心对齐。这种对齐使得能够精确解释复杂的语义上下文。此外,我们引入了一个可学习的码本,以便在推理过程中表示M-LLM,在保持性能的同时显著降低计算需求。我们的框架通过严格的实验展示了其通用性和有效性,在零样本COS中取得了领先的性能,在CAMO上的 $F_{\beta }^{w}$ 分数为72.9%,在COD10K上为71.7%。通过在推理过程中去除M-LLM,我们实现了与传统端到端模型相当的推理速度,达到18.1 FPS。此外,我们的方法在息肉分割和水下场景分割方面表现出色,在零样本和监督设置下均优于具有挑战性的基线,从而突出了其在各种分割任务中广泛应用的潜力。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验