Eguchi Akihiro, Horii Takato, Nagai Takayuki, Kanai Ryota, Oizumi Masafumi
Basic Research Group, Araya Inc., Tokyo, Japan.
Department of Systems Innovation, Graduate School of Engineering Science, Osaka University, Osaka, Japan.
Front Comput Neurosci. 2020 Jan 29;14:1. doi: 10.3389/fncom.2020.00001. eCollection 2020.
Modality-invariant categorical representations, i.e., shared representation, is thought to play a key role in learning to categorize multi-modal information. We have investigated how a bimodal autoencoder can form a shared representation in an unsupervised manner with multi-modal data. We explored whether altering the depth of the network and mixing the multi-modal inputs at the input layer affect the development of the shared representations. Based on the activation of units in the hidden layers, we classified them into four different types: visual cells, auditory cells, inconsistent visual and auditory cells, and consistent visual and auditory cells. Our results show that the number and quality of the last type (i.e., shared representation) significantly differ depending on the depth of the network and are enhanced when the network receives mixed inputs as opposed to separate inputs for each modality, as occurs in typical two-stage frameworks. In the present work, we present a way to utilize information theory to understand the abstract representations formed in the hidden layers of the network. We believe that such an information theoretic approach could potentially provide insights into the development of more efficient and cost-effective ways to train neural networks using qualitative measures of the representations that cannot be captured by analyzing only the final outputs of the networks.
模态不变的类别表征,即共享表征,被认为在学习对多模态信息进行分类中起着关键作用。我们研究了双模态自动编码器如何以无监督方式与多模态数据形成共享表征。我们探讨了改变网络深度以及在输入层混合多模态输入是否会影响共享表征的形成。基于隐藏层中单元的激活情况,我们将它们分为四种不同类型:视觉细胞、听觉细胞、视觉与听觉不一致的细胞以及视觉与听觉一致的细胞。我们的结果表明,最后一种类型(即共享表征)的数量和质量会因网络深度的不同而显著不同,并且当网络接收混合输入而非像典型两阶段框架中那样每种模态的单独输入时,共享表征会得到增强。在本研究中,我们提出了一种利用信息论来理解在网络隐藏层中形成的抽象表征的方法。我们相信,这种信息论方法有可能为开发更高效且经济高效的神经网络训练方式提供见解,这些方式使用仅通过分析网络最终输出无法捕捉的表征的定性度量。