IEEE Trans Pattern Anal Mach Intell. 2022 Jan;44(1):456-473. doi: 10.1109/TPAMI.2020.3009758. Epub 2021 Dec 7.
The traditional object (person) retrieval (re-identification) task aims to learn a discriminative feature representation with intra-similarity and inter-dissimilarity, which supposes that the objects in an image are manually or automatically pre-cropped exactly. However, in many real-world searching scenarios (e.g., video surveillance), the objects (e.g., persons, vehicles, etc.) are seldom accurately detected or annotated. Therefore, object-level retrieval becomes intractable without bounding-box annotation, which leads to a new but challenging topic, i.e., image-level search with multi-task integration of joint detection and retrieval. In this paper, to address the image search issue, we first introduce an end-to-end Integrated Net (I-Net), which has three merits: 1) A Siamese architecture and an on-line pairing strategy for similar and dissimilar objects in the given images are designed. Benefited by the Siamese structure, I-Net learns the shared feature representation, because, on which, both object detection and classification tasks are handled. 2) A novel on-line pairing (OLP) loss is introduced with a dynamic feature dictionary, which alleviates the multi-task training stagnation problem, by automatically generating a number of negative pairs to restrict the positives. 3) A hard example priority (HEP) based softmax loss is proposed to improve the robustness of classification task by selecting hard categories. The shared feature representation of I-Net may restrict the task-specific flexibility and learning capability between detection and retrieval tasks. Therefore, with the philosophy of divide and conquer, we further propose an improved I-Net, called DC-I-Net, which makes two new contributions: 1) two modules are tailored to handle different tasks separately in the integrated framework, such that the task specification is guaranteed. 2) A class-center guided HEP loss (C HEP) by exploiting the stored class centers is proposed, such that the intra-similarity and inter-dissimilarity can be captured for ultimate retrieval. Extensive experiments on famous image-level search oriented benchmark datasets, such as CUHK-SYSU dataset and PRW dataset for person search and the large-scale WebTattoo dataset for tattoo search, demonstrate that the proposed DC-I-Net outperforms the state-of-the-art tasks-integrated and tasks-separated image search models.
传统的目标(人)检索(再识别)任务旨在学习具有内相似性和外差异性的判别特征表示,这假设图像中的目标是手动或自动精确裁剪的。然而,在许多现实世界的搜索场景(例如,视频监控)中,目标(例如,人、车辆等)很少被准确检测或注释。因此,在没有边界框注释的情况下,目标级别的检索变得难以处理,这导致了一个新的但具有挑战性的话题,即具有联合检测和检索的多任务集成的图像级搜索。在本文中,为了解决图像搜索问题,我们首先引入了一个端到端的集成网络(I-Net),它具有三个优点:1)设计了一种用于给定图像中相似和不相似对象的暹罗架构和在线配对策略。受益于暹罗结构,I-Net 学习共享特征表示,因为它处理对象检测和分类任务。2)引入了一种新颖的在线配对(OLP)损失,具有动态特征字典,通过自动生成大量负对来限制正对,缓解了多任务训练停滞问题。3)提出了基于硬例优先级(HEP)的软最大损失,通过选择困难类别来提高分类任务的鲁棒性。I-Net 的共享特征表示可能会限制检测和检索任务之间的特定任务灵活性和学习能力。因此,基于分而治之的思想,我们进一步提出了一个改进的 I-Net,称为 DC-I-Net,它有两个新的贡献:1)在集成框架中,两个模块分别用于处理不同的任务,从而保证了任务规范。2)提出了一种利用存储的类中心的类中心引导 HEP 损失(C HEP),以捕获最终检索的内相似性和外差异性。在著名的面向图像级搜索的基准数据集(例如,用于人员搜索的 CUHK-SYSU 数据集和 PRW 数据集,以及用于纹身搜索的大型 WebTattoo 数据集)上进行了广泛的实验,证明了所提出的 DC-I-Net 优于最先进的任务集成和任务分离的图像搜索模型。