Wang Hao, Jia Tong, Wang Qilong, Zuo Wangmeng
IEEE Trans Image Process. 2025;34:4052-4066. doi: 10.1109/TIP.2025.3581729.
Building an effective object detector usually depends on large well-annotated training samples. While annotating such dataset is extremely laborious and costly, where box-level supervision which contains both accurate classification category and localization coordinate is required. Compared to above box-level supervised annotation, those weakly supervised learning manners (e.g,, category, point and scribble) need relatively less laborious annotation cost, and provide a feasible way to mitigate the reliance on the dataset. Because of the lack of sufficient supervised information, current weakly supervised methods cannot achieve satisfactory detection performance. Recently, Segment Anything Model (SAM) has appeared as a task-agnostic foundation model and shown promising performance improvement in many related works due to its powerful generalization and data processing abilities. The properties of the SAM inspire us to adopt such basic benchmark to weakly supervised object detection field to compensate the deficiencies in supervised information. However, directly deploying SAM on weakly supervised object detection task meets with two issues. Firstly, SAM needs meticulously-designed prompts, and such expert-level prompts restrict their applicability and practicality. Besides, SAM is a category unawareness model, and it cannot assign the category labels to the generated predictions. To solve above issues, we propose WS-SAM, which generalizes Segment Anything Model (SAM) to weakly supervised object detection with category label. Specifically, we design an adaptive prompt generator to take full advantages of the spatial and semantic information from the prompt. It employs in a self-prompting manner by taking the output of SAM from the previous iteration as the prompt input to guide the next iteration, where the prompts can be adaptively generated based on the classification activation map. We also develop a segmentation mask refinement module and formulate the label assignment process as a shortest path optimization problem by considering the similarity between each location and prompts. Furthermore, a bidirectional adapter is also implemented to resolve the domain discrepancy by incorporating domain-specific information. We evaluate the effectiveness of our method on several detection datasets (e.g., PASCAL VOC and MS COCO), and the experiment results show that our proposed method can achieve clear improvement over state-of-the-art methods, while performing favorably against state-of-the-arts.
构建一个有效的目标检测器通常依赖于大量标注良好的训练样本。虽然标注这样的数据集极其费力且成本高昂,因为需要框级监督,其中包含准确的分类类别和定位坐标。与上述框级监督标注相比,那些弱监督学习方式(例如类别、点和涂鸦)需要相对较少的标注成本,并提供了一种可行的方法来减轻对数据集的依赖。由于缺乏足够的监督信息,当前的弱监督方法无法实现令人满意的检测性能。最近,分割一切模型(SAM)作为一个与任务无关的基础模型出现,并因其强大的泛化和数据处理能力在许多相关工作中显示出有前景的性能提升。SAM的特性启发我们将这样的基础基准应用于弱监督目标检测领域,以弥补监督信息中的不足。然而,直接将SAM部署在弱监督目标检测任务上会遇到两个问题。首先,SAM需要精心设计的提示,而这种专家级提示限制了它们的适用性和实用性。此外,SAM是一个不区分类别的模型,它不能为生成的预测分配类别标签。为了解决上述问题,我们提出了WS-SAM,它将分割一切模型(SAM)推广到带有类别标签的弱监督目标检测。具体来说,我们设计了一个自适应提示生成器,以充分利用来自提示的空间和语义信息。它以自提示的方式工作,将上一次迭代中SAM的输出作为提示输入来指导下一次迭代,其中提示可以基于分类激活图自适应生成。我们还开发了一个分割掩码细化模块,并通过考虑每个位置与提示之间的相似性,将标签分配过程表述为一个最短路径优化问题。此外,还实现了一个双向适配器,通过纳入特定领域信息来解决域差异问题。我们在几个检测数据集(例如PASCAL VOC和MS COCO)上评估了我们方法的有效性,实验结果表明,我们提出的方法相对于现有方法可以实现明显的改进,同时与现有技术相比表现良好。