IEEE Trans Pattern Anal Mach Intell. 2022 Nov;44(11):8306-8320. doi: 10.1109/TPAMI.2021.3113270. Epub 2022 Oct 4.
Deep metric learning (DML) is a cornerstone of many computer vision applications. It aims at learning a mapping from the input domain to an embedding space, where semantically similar objects are located nearby and dissimilar objects far from another. The target similarity on the training data is defined by user in form of ground-truth class labels. However, while the embedding space learns to mimic the user-provided similarity on the training data, it should also generalize to novel categories not seen during training. Besides user-provided groundtruth training labels, a lot of additional visual factors (such as viewpoint changes or shape peculiarities) exist and imply different notions of similarity between objects, affecting the generalization on the images unseen during training. However, existing approaches usually directly learn a single embedding space on all available training data, struggling to encode all different types of relationships, and do not generalize well. We propose to build a more expressive representation by jointly splitting the embedding space and the data hierarchically into smaller sub-parts. We successively focus on smaller subsets of the training data, reducing its variance and learning a different embedding subspace for each data subset. Moreover, the subspaces are learned jointly to cover not only the intricacies, but the breadth of the data as well. Only after that, we build the final embedding from the subspaces in the conquering stage. The proposed algorithm acts as a transparent wrapper that can be placed around arbitrary existing DML methods. Our approach significantly improves upon the state-of-the-art on image retrieval, clustering, and re-identification tasks evaluated using CUB200-2011, CARS196, Stanford Online Products, In-shop Clothes, and PKU VehicleID datasets.
深度度量学习(DML)是许多计算机视觉应用的基础。它旨在学习从输入域到嵌入空间的映射,其中语义相似的对象位于附近,而不相似的对象则相距较远。训练数据上的目标相似度由用户以地面真实类标签的形式定义。然而,当嵌入空间学习模仿训练数据上用户提供的相似度时,它也应该推广到训练期间未见过的新类别。除了用户提供的地面真实训练标签外,还有许多额外的视觉因素(例如视角变化或形状特征)存在,并暗示了对象之间不同的相似概念,这会影响训练期间未见图像的泛化能力。然而,现有的方法通常直接在所有可用的训练数据上学习单个嵌入空间,难以编码所有不同类型的关系,并且泛化能力不佳。我们建议通过联合分割嵌入空间和数据层次结构成较小的子部分来构建更具表现力的表示。我们依次关注训练数据的较小子集,减少其方差,并为每个数据子集学习不同的嵌入子空间。此外,子空间是联合学习的,以涵盖不仅是数据的复杂性,还有其广度。只有在那之后,我们才在征服阶段从子空间中构建最终嵌入。所提出的算法充当透明包装器,可以放置在任意现有的 DML 方法周围。我们的方法在使用 CUB200-2011、CARS196、斯坦福在线产品、店内服装和 PKU VehicleID 数据集评估的图像检索、聚类和重新识别任务上显著优于最新技术。