Schuster Viktoria, Krogh Anders
Center for Health Data Science, University of Copenhagen, 2200 Copenhagen, Denmark.
Department of Computer Science, University of Copenhagen, 2100 Copenhagen, Denmark.
Entropy (Basel). 2021 Oct 25;23(11):1403. doi: 10.3390/e23111403.
Autoencoders are commonly used in representation learning. They consist of an encoder and a decoder, which provide a straightforward method to map -dimensional data in input space to a lower -dimensional representation space and back. The decoder itself defines an -dimensional manifold in input space. Inspired by manifold learning, we showed that the decoder can be trained on its own by learning the representations of the training samples along with the decoder weights using gradient descent. A sum-of-squares loss then corresponds to optimizing the manifold to have the smallest Euclidean distance to the training samples, and similarly for other loss functions. We derived expressions for the number of samples needed to specify the encoder and decoder and showed that the decoder generally requires much fewer training samples to be well-specified compared to the encoder. We discuss the training of autoencoders in this perspective and relate it to previous work in the field that uses noisy training examples and other types of regularization. On the natural image data sets MNIST and CIFAR10, we demonstrated that the decoder is much better suited to learn a low-dimensional representation, especially when trained on small data sets. Using simulated gene regulatory data, we further showed that the decoder alone leads to better generalization and meaningful representations. Our approach of training the decoder alone facilitates representation learning even on small data sets and can lead to improved training of autoencoders. We hope that the simple analyses presented will also contribute to an improved conceptual understanding of representation learning.
自动编码器常用于表示学习。它们由一个编码器和一个解码器组成,提供了一种直接的方法,将输入空间中的高维数据映射到低维表示空间,然后再映射回来。解码器本身在输入空间中定义了一个高维流形。受流形学习的启发,我们表明,可以通过使用梯度下降法学习训练样本的表示以及解码器权重,单独训练解码器。平方和损失对应于优化流形,使其与训练样本的欧几里得距离最小,其他损失函数也是如此。我们推导了指定编码器和解码器所需样本数量的表达式,并表明与编码器相比,解码器通常需要少得多的训练样本就能得到很好的指定。我们从这个角度讨论自动编码器的训练,并将其与该领域之前使用有噪声训练示例和其他正则化类型的工作联系起来。在自然图像数据集MNIST和CIFAR10上,我们证明了解码器更适合学习低维表示,特别是在小数据集上进行训练时。使用模拟的基因调控数据,我们进一步表明,仅解码器就能带来更好的泛化能力和有意义的表示。我们单独训练解码器的方法即使在小数据集上也有助于表示学习,并且可以改进自动编码器的训练。我们希望所呈现的简单分析也将有助于提高对表示学习的概念理解。