IEEE Trans Image Process. 2022;31:6097-6108. doi: 10.1109/TIP.2022.3205216. Epub 2022 Sep 22.
Text-based person search aims at retrieving the target person in an image gallery using a descriptive sentence of that person. The core of this task is to calculate a similarity score between the pedestrian image and description, which requires inferring the complex latent correspondence between image sub-regions and textual phrases at different scales. Transformer is an intuitive way to model the complex alignment by its self-attention mechanism. Most previous Transformer-based methods simply concatenate image region features and text features as input and learn a cross-modal representation in a brute force manner. Such weakly supervised learning approaches fail to explicitly build alignment between image region features and text features, causing an inferior feature distribution. In this paper, we present CFLT, Conditional Feature Learning based Transformer. It maps the sub-regions and phrases into a unified latent space and explicitly aligns them by constructing conditional embeddings where the feature of data from one modality is dynamically adjusted based on the data from the other modality. The output of our CFLT is a set of similarity scores for each sub-region or phrase rather than a cross-modal representation. Furthermore, we propose a simple and effective multi-modal re-ranking method named Re-ranking scheme by Visual Conditional Feature (RVCF). Benefit from the visual conditional feature and better feature distribution in our CFLT, the proposed RVCF achieves significant performance improvement. Experimental results show that our CFLT outperforms the state-of-the-art methods by 7.03% in terms of top-1 accuracy and 5.01% in terms of top-5 accuracy on the text-based person search dataset.
基于文本的人像搜索旨在通过描述某个人的句子在图像库中检索目标人物。这项任务的核心是计算行人图像和描述之间的相似度得分,这需要推断图像子区域和文本短语在不同尺度上的复杂潜在对应关系。Transformer 是通过其自注意力机制来建模复杂对齐的直观方法。以前的大多数基于 Transformer 的方法只是简单地将图像区域特征和文本特征连接起来作为输入,并以一种强制的方式学习跨模态表示。这种弱监督学习方法未能明确地建立图像区域特征和文本特征之间的对齐关系,导致特征分布较差。在本文中,我们提出了 CFLT,基于条件特征学习的 Transformer。它将子区域和短语映射到一个统一的潜在空间,并通过构建条件嵌入来显式对齐它们,其中一种模态的数据特征根据另一种模态的数据动态调整。我们的 CFLT 的输出是每个子区域或短语的一组相似度得分,而不是跨模态表示。此外,我们提出了一种简单而有效的多模态重新排序方法,称为基于视觉条件特征的重新排序方法(RVCF)。受益于我们的 CFLT 中的视觉条件特征和更好的特征分布,所提出的 RVCF 实现了显著的性能提升。实验结果表明,我们的 CFLT 在基于文本的人像搜索数据集上的准确率提高了 7.03%,在准确率提高了 5.01%。