• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于文本到图像人物检索的文本引导图像恢复与语义增强

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval.

作者信息

Liu Delong, Li Haiwen, Zhao Zhicheng, Dong Yuan

机构信息

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China.

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China; Beijing Key Laboratory of Network System and Network Culture, Beijing, China.

出版信息

Neural Netw. 2025 Apr;184:107028. doi: 10.1016/j.neunet.2024.107028. Epub 2024 Dec 16.

DOI:10.1016/j.neunet.2024.107028
PMID:39700822
Abstract

The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual-textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual entities to specific image regions, improving the alignment between local textual and visual embeddings. Additionally, a cross-modal triplet loss is presented to handle hard samples, and further enhance the model's discriminability for minor differences. Moreover, a pruning-based text data augmentation approach is proposed to enhance focus on essential elements in descriptions, thereby avoiding excessive model attention to less significant information. The experimental results show our proposed method outperforms state-of-the-art methods on three popular benchmark datasets, and the code will be made publicly available at https://github.com/Delong-liu-bupt/SEN.

摘要

文本到图像的人物检索(TIPR)的目标是根据给定的文本描述检索特定的人物图像。这项任务的一个主要挑战是弥合视觉和文本模态之间巨大的表征差距。主流方法将文本和图像映射到统一的嵌入空间进行匹配,而文本和图像之间复杂的语义对应关系仍未有效构建。为了解决这个问题,我们提出了一种新颖的TIPR框架,以在人物图像和相应文本之间建立细粒度的交互和对齐。具体来说,通过微调对比语言-图像预训练(CLIP)模型,首先构建一个视觉-文本双编码器,以初步对齐图像和文本特征。其次,提出了一个文本引导的图像恢复(TIR)辅助任务,将抽象的文本实体映射到特定的图像区域,改善局部文本和视觉嵌入之间的对齐。此外,还提出了一种跨模态三元组损失来处理困难样本,并进一步提高模型对微小差异的辨别能力。此外,还提出了一种基于剪枝的文本数据增强方法,以增强对描述中关键元素的关注,从而避免模型过度关注不太重要的信息。实验结果表明,我们提出的方法在三个流行的基准数据集上优于现有方法,代码将在https://github.com/Delong-liu-bupt/SEN上公开提供。

相似文献

1
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval.用于文本到图像人物检索的文本引导图像恢复与语义增强
Neural Netw. 2025 Apr;184:107028. doi: 10.1016/j.neunet.2024.107028. Epub 2024 Dec 16.
2
Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training.高效的基于令牌的图像-文本检索与一致的多模态对比训练。
IEEE Trans Image Process. 2023;32:3622-3633. doi: 10.1109/TIP.2023.3286710. Epub 2023 Jul 3.
3
Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling.基于文本感知跨模态对比解缠的多粒度视觉枢轴引导多模态神经机器翻译
Neural Netw. 2024 Oct;178:106403. doi: 10.1016/j.neunet.2024.106403. Epub 2024 May 23.
4
Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment.无监督视觉-文本关联学习与细粒度语义对齐。
IEEE Trans Cybern. 2022 May;52(5):3669-3683. doi: 10.1109/TCYB.2020.3015084. Epub 2022 May 19.
5
Histopathology language-image representation learning for fine-grained digital pathology cross-modal retrieval.用于细粒度数字病理学跨模态检索的组织病理学语言-图像表示学习
Med Image Anal. 2024 Jul;95:103163. doi: 10.1016/j.media.2024.103163. Epub 2024 Apr 9.
6
Visual context learning based on textual knowledge for image-text retrieval.基于文本知识的视觉上下文学习用于图像-文本检索。
Neural Netw. 2022 Aug;152:434-449. doi: 10.1016/j.neunet.2022.05.008. Epub 2022 May 18.
7
Boosting cross-modal retrieval in remote sensing via a novel unified attention network.通过一种新颖的统一注意力网络提升遥感的跨模态检索。
Neural Netw. 2024 Dec;180:106718. doi: 10.1016/j.neunet.2024.106718. Epub 2024 Sep 11.
8
VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search.VGSG:用于基于文本的行人搜索的视觉引导语义组网络。
IEEE Trans Image Process. 2024;33:163-176. doi: 10.1109/TIP.2023.3337653. Epub 2023 Dec 8.
9
Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective.从多任务视角看自然保护图像数据中的细粒度跨模态语义一致性
Sensors (Basel). 2024 May 14;24(10):3130. doi: 10.3390/s24103130.
10
Novel cross-dimensional coarse-fine-grained complementary network for image-text matching.用于图像-文本匹配的新型跨维度粗细粒度互补网络。
PeerJ Comput Sci. 2025 Mar 3;11:e2725. doi: 10.7717/peerj-cs.2725. eCollection 2025.