Suppr超能文献

通过多粒度图像-文本对齐改进基于描述的行人重识别

Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments.

作者信息

Niu Kai, Huang Yan, Ouyang Wanli, Wang Liang

出版信息

IEEE Trans Image Process. 2020 Apr 7. doi: 10.1109/TIP.2020.2984883.

Abstract

Description-based person re-identification (Re-id) is an important task in video surveillance that requires discriminative cross-modal representations to distinguish different people. It is difficult to directly measure the similarity between images and descriptions due to the modality heterogeneity (the crossmodal problem). And all samples belonging to a single category (the fine-grained problem) makes this task even harder than the conventional image-description matching task. In this paper, we propose a Multi-granularity Image-text Alignments (MIA) model to alleviate the cross-modal fine-grained problem for better similarity evaluation in description-based person Re-id. Specifically, three different granularities, i.e., global-global, global-local and local-local alignments are carried out hierarchically. Firstly, the global-global alignment in the Global Contrast (GC) module is for matching the global contexts of images and descriptions. Secondly, the global-local alignment employs the potential relations between local components and global contexts to highlight the distinguishable components while eliminating the uninvolved ones adaptively in the Relation-guided Global-local Alignment (RGA) module. Thirdly, as for the local-local alignment, we match visual human parts with noun phrases in the Bi-directional Fine-grained Matching (BFM) module. The whole network combining multiple granularities can be end-to-end trained without complex preprocessing. To address the difficulties in training the combination of multiple granularities, an effective step training strategy is proposed to train these granularities step-by-step. Extensive experiments and analysis have shown that our method obtains the state-of-the-art performance on the CUHK-PEDES dataset and outperforms the previous methods by a significant margin.

摘要

基于描述的行人重识别(Re-id)是视频监控中的一项重要任务,它需要有判别力的跨模态表示来区分不同的人。由于模态异质性(跨模态问题),很难直接测量图像与描述之间的相似度。而且所有属于单个类别的样本(细粒度问题)使得这项任务比传统的图像-描述匹配任务更加困难。在本文中,我们提出了一种多粒度图像-文本对齐(MIA)模型,以缓解跨模态细粒度问题,从而在基于描述的行人Re-id中进行更好的相似度评估。具体来说,我们分层进行三种不同粒度的对齐,即全局-全局、全局-局部和局部-局部对齐。首先,全局对比(GC)模块中的全局-全局对齐用于匹配图像和描述的全局上下文。其次,全局-局部对齐利用局部组件与全局上下文之间的潜在关系,在关系引导的全局-局部对齐(RGA)模块中突出可区分的组件,同时自适应地消除无关组件。第三,对于局部-局部对齐,我们在双向细粒度匹配(BFM)模块中将视觉人体部位与名词短语进行匹配。结合多个粒度的整个网络可以在无需复杂预处理的情况下进行端到端训练。为了解决训练多个粒度组合的困难,我们提出了一种有效的分步训练策略来逐步训练这些粒度。大量的实验和分析表明,我们的方法在CUHK-PEDES数据集上取得了领先的性能,并且显著优于以前的方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验