Intelligent Computer Vision Software Laboratory (ICVSLab), Department of Electronic Engineering, Yeungnam University, 280 Daehak-Ro, Gyeongsan 38541, Gyeongbuk, Korea.
Department of Electrical Engineering, Pohang University of Science and Technology, Pohang 37673, Korea.
Sensors (Basel). 2022 Sep 8;22(18):6816. doi: 10.3390/s22186816.
For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks. Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past. Recently, using natural language to process 2D or 3D images and videos with the immense power of neural nets has witnessed a promising future. Despite the diverse range of remarkable work in this field, notably in the past few years, rapid improvements have also solved future challenges for researchers. Moreover, the connection between these two domains is mainly subjected to GAN, thus limiting the horizons of this field. This review analyzes Text-to-Image (T2I) synthesis as a broader picture, Text-guided Visual-output (T2Vo), with the primary goal being to highlight the gaps by proposing a more comprehensive taxonomy. We broadly categorize text-guided visual output into three main divisions and meaningful subdivisions by critically examining an extensive body of literature from top-tier computer vision venues and closely related fields, such as machine learning and human-computer interaction, aiming at state-of-the-art models with a comparative analysis. This study successively follows previous surveys on T2I, adding value by analogously evaluating the diverse range of existing methods, including different generative models, several types of visual output, critical examination of various approaches, and highlighting the shortcomings, suggesting the future direction of research.
几十年来,将不同的数据领域联系起来以充分发挥机器的潜力一直是研究的重点,尤其是在神经网络领域。同样,文本和视觉数据(图像和视频)是两个截然不同的数据领域,过去也有广泛的研究。最近,利用自然语言和神经网络的强大功能来处理二维或三维图像和视频,已经展现出了广阔的前景。尽管在这一领域取得了许多显著的成果,尤其是在过去几年,但快速的改进也为研究人员解决了未来的挑战。此外,这两个领域的联系主要取决于 GAN,这限制了该领域的发展。
本篇综述分析了文本到图像(Text-to-Image,T2I)合成作为一个更广泛的领域,即文本引导的视觉输出(Text-guided Visual-output,T2Vo),主要目标是通过提出更全面的分类法来突出差距。我们通过批判性地检查来自顶级计算机视觉会议和密切相关领域(如机器学习和人机交互)的大量文献,将文本引导的视觉输出广泛分为三个主要部分和有意义的细分部分,旨在对最先进的模型进行比较分析。
本研究在之前的 T2I 综述的基础上进行了扩展,通过类似地评估各种现有的方法,包括不同的生成模型、多种类型的视觉输出、对各种方法的仔细检查以及突出缺点,为该领域提供了更多的价值,为未来的研究方向提供了建议。