Liu Delong, Li Haiwen, Zhao Zhicheng, Dong Yuan
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China.
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China; Beijing Key Laboratory of Network System and Network Culture, Beijing, China.
Neural Netw. 2025 Apr;184:107028. doi: 10.1016/j.neunet.2024.107028. Epub 2024 Dec 16.
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual-textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual entities to specific image regions, improving the alignment between local textual and visual embeddings. Additionally, a cross-modal triplet loss is presented to handle hard samples, and further enhance the model's discriminability for minor differences. Moreover, a pruning-based text data augmentation approach is proposed to enhance focus on essential elements in descriptions, thereby avoiding excessive model attention to less significant information. The experimental results show our proposed method outperforms state-of-the-art methods on three popular benchmark datasets, and the code will be made publicly available at https://github.com/Delong-liu-bupt/SEN.
文本到图像的人物检索(TIPR)的目标是根据给定的文本描述检索特定的人物图像。这项任务的一个主要挑战是弥合视觉和文本模态之间巨大的表征差距。主流方法将文本和图像映射到统一的嵌入空间进行匹配,而文本和图像之间复杂的语义对应关系仍未有效构建。为了解决这个问题,我们提出了一种新颖的TIPR框架,以在人物图像和相应文本之间建立细粒度的交互和对齐。具体来说,通过微调对比语言-图像预训练(CLIP)模型,首先构建一个视觉-文本双编码器,以初步对齐图像和文本特征。其次,提出了一个文本引导的图像恢复(TIR)辅助任务,将抽象的文本实体映射到特定的图像区域,改善局部文本和视觉嵌入之间的对齐。此外,还提出了一种跨模态三元组损失来处理困难样本,并进一步提高模型对微小差异的辨别能力。此外,还提出了一种基于剪枝的文本数据增强方法,以增强对描述中关键元素的关注,从而避免模型过度关注不太重要的信息。实验结果表明,我们提出的方法在三个流行的基准数据集上优于现有方法,代码将在https://github.com/Delong-liu-bupt/SEN上公开提供。