• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

多模态学习综述——从文本指导的视觉处理视角。

A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint.

机构信息

Intelligent Computer Vision Software Laboratory (ICVSLab), Department of Electronic Engineering, Yeungnam University, 280 Daehak-Ro, Gyeongsan 38541, Gyeongbuk, Korea.

Department of Electrical Engineering, Pohang University of Science and Technology, Pohang 37673, Korea.

出版信息

Sensors (Basel). 2022 Sep 8;22(18):6816. doi: 10.3390/s22186816.

DOI:10.3390/s22186816
PMID:36146161
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9503702/
Abstract

For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks. Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past. Recently, using natural language to process 2D or 3D images and videos with the immense power of neural nets has witnessed a promising future. Despite the diverse range of remarkable work in this field, notably in the past few years, rapid improvements have also solved future challenges for researchers. Moreover, the connection between these two domains is mainly subjected to GAN, thus limiting the horizons of this field. This review analyzes Text-to-Image (T2I) synthesis as a broader picture, Text-guided Visual-output (T2Vo), with the primary goal being to highlight the gaps by proposing a more comprehensive taxonomy. We broadly categorize text-guided visual output into three main divisions and meaningful subdivisions by critically examining an extensive body of literature from top-tier computer vision venues and closely related fields, such as machine learning and human-computer interaction, aiming at state-of-the-art models with a comparative analysis. This study successively follows previous surveys on T2I, adding value by analogously evaluating the diverse range of existing methods, including different generative models, several types of visual output, critical examination of various approaches, and highlighting the shortcomings, suggesting the future direction of research.

摘要

几十年来,将不同的数据领域联系起来以充分发挥机器的潜力一直是研究的重点,尤其是在神经网络领域。同样,文本和视觉数据(图像和视频)是两个截然不同的数据领域,过去也有广泛的研究。最近,利用自然语言和神经网络的强大功能来处理二维或三维图像和视频,已经展现出了广阔的前景。尽管在这一领域取得了许多显著的成果,尤其是在过去几年,但快速的改进也为研究人员解决了未来的挑战。此外,这两个领域的联系主要取决于 GAN,这限制了该领域的发展。

本篇综述分析了文本到图像(Text-to-Image,T2I)合成作为一个更广泛的领域,即文本引导的视觉输出(Text-guided Visual-output,T2Vo),主要目标是通过提出更全面的分类法来突出差距。我们通过批判性地检查来自顶级计算机视觉会议和密切相关领域(如机器学习和人机交互)的大量文献,将文本引导的视觉输出广泛分为三个主要部分和有意义的细分部分,旨在对最先进的模型进行比较分析。

本研究在之前的 T2I 综述的基础上进行了扩展,通过类似地评估各种现有的方法,包括不同的生成模型、多种类型的视觉输出、对各种方法的仔细检查以及突出缺点,为该领域提供了更多的价值,为未来的研究方向提供了建议。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/572af843929e/sensors-22-06816-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/6d821a669e38/sensors-22-06816-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/d64821b0b0fc/sensors-22-06816-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/c2bc5703ef0f/sensors-22-06816-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/7428614b3aca/sensors-22-06816-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/127c4016dcd5/sensors-22-06816-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/075363c6b810/sensors-22-06816-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/0c08ac3690d8/sensors-22-06816-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/563decb79db2/sensors-22-06816-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/7aa2fbb61486/sensors-22-06816-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/e5c20482e5b5/sensors-22-06816-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/572af843929e/sensors-22-06816-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/6d821a669e38/sensors-22-06816-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/d64821b0b0fc/sensors-22-06816-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/c2bc5703ef0f/sensors-22-06816-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/7428614b3aca/sensors-22-06816-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/127c4016dcd5/sensors-22-06816-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/075363c6b810/sensors-22-06816-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/0c08ac3690d8/sensors-22-06816-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/563decb79db2/sensors-22-06816-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/7aa2fbb61486/sensors-22-06816-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/e5c20482e5b5/sensors-22-06816-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8b0/9503702/572af843929e/sensors-22-06816-g011.jpg

相似文献

1
A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint.多模态学习综述——从文本指导的视觉处理视角。
Sensors (Basel). 2022 Sep 8;22(18):6816. doi: 10.3390/s22186816.
2
Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling.基于文本感知跨模态对比解缠的多粒度视觉枢轴引导多模态神经机器翻译
Neural Netw. 2024 Oct;178:106403. doi: 10.1016/j.neunet.2024.106403. Epub 2024 May 23.
3
Adversarial text-to-image synthesis: A review.对抗文本到图像合成:综述。
Neural Netw. 2021 Dec;144:187-209. doi: 10.1016/j.neunet.2021.07.019. Epub 2021 Aug 8.
4
SAM-GAN: Self-Attention supporting Multi-stage Generative Adversarial Networks for text-to-image synthesis.SAM-GAN:用于文本到图像合成的支持多阶段生成对抗网络的自注意力模型。
Neural Netw. 2021 Jun;138:57-67. doi: 10.1016/j.neunet.2021.01.023. Epub 2021 Feb 10.
5
Bone shadow segmentation from ultrasound data for orthopedic surgery using GAN.基于 GAN 的骨科手术超声数据中骨影分割。
Int J Comput Assist Radiol Surg. 2020 Sep;15(9):1477-1485. doi: 10.1007/s11548-020-02221-z. Epub 2020 Jul 11.
6
Accuracy of Using Generative Adversarial Networks for Glaucoma Detection: Systematic Review and Bibliometric Analysis.使用生成对抗网络进行青光眼检测的准确性:系统评价和文献计量分析。
J Med Internet Res. 2021 Sep 21;23(9):e27414. doi: 10.2196/27414.
7
From Show to Tell: A Survey on Deep Learning-Based Image Captioning.从展示到讲述:基于深度学习的图像字幕研究综述
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):539-559. doi: 10.1109/TPAMI.2022.3148210. Epub 2022 Dec 5.
8
Latent Dirichlet allocation based generative adversarial networks.基于潜在狄利克雷分配的生成对抗网络。
Neural Netw. 2020 Dec;132:461-476. doi: 10.1016/j.neunet.2020.08.012. Epub 2020 Sep 21.
9
Cross-Modal Search for Social Networks via Adversarial Learning.基于对抗学习的社交网络跨模态检索。
Comput Intell Neurosci. 2020 Jul 11;2020:7834953. doi: 10.1155/2020/7834953. eCollection 2020.
10
A Comprehensive Survey on Graph Neural Networks.图神经网络综述。
IEEE Trans Neural Netw Learn Syst. 2021 Jan;32(1):4-24. doi: 10.1109/TNNLS.2020.2978386. Epub 2021 Jan 4.

本文引用的文献

1
Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models.深度生成模型:VAE、GAN、归一化流、基于能量和自回归模型的比较综述。
IEEE Trans Pattern Anal Mach Intell. 2022 Nov;44(11):7327-7347. doi: 10.1109/TPAMI.2021.3116668. Epub 2022 Oct 4.
2
Adversarial text-to-image synthesis: A review.对抗文本到图像合成:综述。
Neural Netw. 2021 Dec;144:187-209. doi: 10.1016/j.neunet.2021.07.019. Epub 2021 Aug 8.
3
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets.
计算机视觉深度多模态学习综述:进展、趋势、应用及数据集
Vis Comput. 2022;38(8):2939-2970. doi: 10.1007/s00371-021-02166-7. Epub 2021 Jun 10.
4
A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects.卷积神经网络综述:分析、应用与展望
IEEE Trans Neural Netw Learn Syst. 2022 Dec;33(12):6999-7019. doi: 10.1109/TNNLS.2021.3084827. Epub 2022 Nov 30.
5
Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis.学习布局和样式可重构 GAN 以实现可控图像合成。
IEEE Trans Pattern Anal Mach Intell. 2022 Sep;44(9):5070-5087. doi: 10.1109/TPAMI.2021.3078577. Epub 2022 Aug 4.
6
Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis.用于细粒度文本到图像合成的多句辅助对抗网络。
IEEE Trans Image Process. 2021;30:2798-2809. doi: 10.1109/TIP.2021.3055062. Epub 2021 Feb 12.
7
Image manipulation with natural language using Two-sided Attentive Conditional Generative Adversarial Network.使用双边注意条件生成对抗网络进行自然语言指导的图像操作。
Neural Netw. 2021 Apr;136:207-217. doi: 10.1016/j.neunet.2020.09.002. Epub 2020 Sep 12.
8
KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis.KT-GAN:用于文本到图像合成的知识转移生成对抗网络。
IEEE Trans Image Process. 2021;30:1275-1290. doi: 10.1109/TIP.2020.3026728. Epub 2020 Dec 23.
9
Semantic Object Accuracy for Generative Text-to-Image Synthesis.生成式文本到图像合成的语义对象准确性。
IEEE Trans Pattern Anal Mach Intell. 2022 Mar;44(3):1552-1565. doi: 10.1109/TPAMI.2020.3021209. Epub 2022 Feb 3.
10
The Facial Action Coding System for Characterization of Human Affective Response to Consumer Product-Based Stimuli: A Systematic Review.用于表征人类对基于消费品的刺激的情感反应的面部动作编码系统:一项系统综述。
Front Psychol. 2020 May 26;11:920. doi: 10.3389/fpsyg.2020.00920. eCollection 2020.