IEEE Trans Pattern Anal Mach Intell. 2021 Nov;43(11):4037-4058. doi: 10.1109/TPAMI.2020.2992393. Epub 2021 Oct 1.
Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used datasets for images, videos, audios, and 3D data, as well as the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.
大规模标记数据通常用于训练深度神经网络,以便在计算机视觉应用中从图像或视频中学习视觉特征时获得更好的性能。为了避免收集和注释大规模数据集的广泛成本,作为无监督学习方法的一个子集,提出了自监督学习方法,以便从大规模未标记数据中学习通用图像和视频特征,而无需使用任何人工注释标签。本文对基于深度学习的自监督通用视觉特征学习方法进行了广泛的回顾,从图像或视频开始。首先,描述了该领域的动机、一般流程和术语。然后总结了用于自监督学习的常见深度神经网络架构。接下来,回顾了自监督学习方法的方案和评估指标,以及常用的图像、视频、音频和 3D 数据集以及现有的自监督视觉特征学习方法。最后,总结并讨论了在基准数据集上对所回顾方法的定量性能比较,分别用于图像和视频特征学习。最后,本文进行了总结,并列出了一组有前途的自监督视觉特征学习的未来方向。