• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于条件特征学习的文本人物搜索 Transformer。

Conditional Feature Learning Based Transformer for Text-Based Person Search.

出版信息

IEEE Trans Image Process. 2022;31:6097-6108. doi: 10.1109/TIP.2022.3205216. Epub 2022 Sep 22.

DOI:10.1109/TIP.2022.3205216
PMID:36103442
Abstract

Text-based person search aims at retrieving the target person in an image gallery using a descriptive sentence of that person. The core of this task is to calculate a similarity score between the pedestrian image and description, which requires inferring the complex latent correspondence between image sub-regions and textual phrases at different scales. Transformer is an intuitive way to model the complex alignment by its self-attention mechanism. Most previous Transformer-based methods simply concatenate image region features and text features as input and learn a cross-modal representation in a brute force manner. Such weakly supervised learning approaches fail to explicitly build alignment between image region features and text features, causing an inferior feature distribution. In this paper, we present CFLT, Conditional Feature Learning based Transformer. It maps the sub-regions and phrases into a unified latent space and explicitly aligns them by constructing conditional embeddings where the feature of data from one modality is dynamically adjusted based on the data from the other modality. The output of our CFLT is a set of similarity scores for each sub-region or phrase rather than a cross-modal representation. Furthermore, we propose a simple and effective multi-modal re-ranking method named Re-ranking scheme by Visual Conditional Feature (RVCF). Benefit from the visual conditional feature and better feature distribution in our CFLT, the proposed RVCF achieves significant performance improvement. Experimental results show that our CFLT outperforms the state-of-the-art methods by 7.03% in terms of top-1 accuracy and 5.01% in terms of top-5 accuracy on the text-based person search dataset.

摘要

基于文本的人像搜索旨在通过描述某个人的句子在图像库中检索目标人物。这项任务的核心是计算行人图像和描述之间的相似度得分,这需要推断图像子区域和文本短语在不同尺度上的复杂潜在对应关系。Transformer 是通过其自注意力机制来建模复杂对齐的直观方法。以前的大多数基于 Transformer 的方法只是简单地将图像区域特征和文本特征连接起来作为输入,并以一种强制的方式学习跨模态表示。这种弱监督学习方法未能明确地建立图像区域特征和文本特征之间的对齐关系,导致特征分布较差。在本文中,我们提出了 CFLT,基于条件特征学习的 Transformer。它将子区域和短语映射到一个统一的潜在空间,并通过构建条件嵌入来显式对齐它们,其中一种模态的数据特征根据另一种模态的数据动态调整。我们的 CFLT 的输出是每个子区域或短语的一组相似度得分,而不是跨模态表示。此外,我们提出了一种简单而有效的多模态重新排序方法,称为基于视觉条件特征的重新排序方法(RVCF)。受益于我们的 CFLT 中的视觉条件特征和更好的特征分布,所提出的 RVCF 实现了显著的性能提升。实验结果表明,我们的 CFLT 在基于文本的人像搜索数据集上的准确率提高了 7.03%,在准确率提高了 5.01%。

相似文献

1
Conditional Feature Learning Based Transformer for Text-Based Person Search.基于条件特征学习的文本人物搜索 Transformer。
IEEE Trans Image Process. 2022;31:6097-6108. doi: 10.1109/TIP.2022.3205216. Epub 2022 Sep 22.
2
Learning Feature Recovery Transformer for Occluded Person Re-Identification.用于遮挡行人重识别的学习特征恢复Transformer
IEEE Trans Image Process. 2022;31:4651-4662. doi: 10.1109/TIP.2022.3186759. Epub 2022 Jul 12.
3
Learning Aligned Image-Text Representations Using Graph Attentive Relational Network.使用图注意力关系网络学习对齐的图像-文本表示
IEEE Trans Image Process. 2021;30:1840-1852. doi: 10.1109/TIP.2020.3048627. Epub 2021 Jan 18.
4
Structure-Aware Positional Transformer for Visible-Infrared Person Re-Identification.基于结构感知的可见光-红外跨模态行人重识别的位置变换模型
IEEE Trans Image Process. 2022;31:2352-2364. doi: 10.1109/TIP.2022.3141868. Epub 2022 Mar 15.
5
A Multi-Level Relation-Aware Transformer model for occluded person re-identification.一种用于遮挡行人再识别的多层次关系感知 Transformer 模型。
Neural Netw. 2024 Sep;177:106382. doi: 10.1016/j.neunet.2024.106382. Epub 2024 May 9.
6
Conditional Feature Embedding by Visual Clue Correspondence Graph for Person Re-Identification.基于视觉线索对应图的条件特征嵌入用于行人重识别
IEEE Trans Image Process. 2022;31:6188-6199. doi: 10.1109/TIP.2022.3206617. Epub 2022 Sep 28.
7
Deep Relation Embedding for Cross-Modal Retrieval.深度关系嵌入的跨模态检索。
IEEE Trans Image Process. 2021;30:617-627. doi: 10.1109/TIP.2020.3038354. Epub 2020 Dec 1.
8
Image-Specific Information Suppression and Implicit Local Alignment for Text-Based Person Search.基于文本的行人搜索中的图像特定信息抑制与隐式局部对齐
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17973-17986. doi: 10.1109/TNNLS.2023.3310118. Epub 2024 Dec 2.
9
SparseMorph: A weakly-supervised lightweight sparse transformer for mono- and multi-modal deformable image registration.稀疏形态学:一种用于单模态和多模态可变形图像配准的弱监督轻量级稀疏转换器。
Comput Biol Med. 2024 Nov;182:109205. doi: 10.1016/j.compbiomed.2024.109205. Epub 2024 Sep 26.
10
Shared-Specific Feature Learning With Bottleneck Fusion Transformer for Multi-Modal Whole Slide Image Analysis.基于瓶颈融合 Transformer 的共享特定特征学习在多模态全切片图像分析中的应用。
IEEE Trans Med Imaging. 2023 Nov;42(11):3374-3383. doi: 10.1109/TMI.2023.3287256. Epub 2023 Oct 27.

引用本文的文献

1
User recommendation method integrating hierarchical graph attention network with multimodal knowledge graph.一种将层次图注意力网络与多模态知识图谱相结合的用户推荐方法。
Front Neurorobot. 2025 Jun 18;19:1587973. doi: 10.3389/fnbot.2025.1587973. eCollection 2025.