Department of Linguistics, University of California, Berkeley, United States of America.
Neural Netw. 2021 Jul;139:305-325. doi: 10.1016/j.neunet.2021.03.017. Epub 2021 Mar 19.
How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs: ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN). These combine Deep Convolutional GAN architecture for audio data (WaveGAN; Donahue et al., 2019) with the information theoretic extension of GAN - InfoGAN (Chen et al., 2016) - and propose a new latent space structure that can model featural learning simultaneously with a higher level classification and allows for a very low-dimension vector representation of lexical items. In addition to the Generator and Discriminator networks, the architectures introduce a network that learns to retrieve latent codes from generated audio outputs. Lexical learning is thus modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from the TIMIT corpus learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space. By manipulating these variables, the network outputs specific lexical items. The network occasionally outputs innovative lexical items that violate training data, but are linguistically interpretable and highly informative for cognitive modeling and neural network interpretability. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on suit and dark outputs innovative start, even though it never saw start or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code. Probing deep neural networks trained on well understood dependencies in speech bears implications for latent space interpretability and understanding how deep neural networks learn meaningful representations, as well as potential for unsupervised text-to-speech generation in the GAN framework.
深度神经网络如何将对应于人类语音中单词的信息编码为原始声学数据?本文提出了两种用于从原始声学输入中对无监督词汇学习进行建模的神经网络架构:ciwGAN(分类信息波 GAN)和 fiwGAN(特征信息波 GAN)。这些架构将用于音频数据的深度卷积 GAN 架构(WaveGAN;Donahue 等人,2019 年)与 GAN 的信息论扩展 - InfoGAN(Chen 等人,2016 年)相结合,并提出了一种新的潜在空间结构,该结构可以同时对特征学习进行建模,并允许对词汇项进行非常低维的向量表示。除了生成器和判别器网络外,该架构还引入了一个从生成音频输出中检索潜在代码的网络。因此,词汇学习被建模为从一个迫使深度神经网络输出数据的架构中涌现出来的,使得从其声学输出中可以检索到独特的信息。在 TIMIT 语料库中的词汇项上训练的网络学会以其潜在空间中分类变量的形式对对应于词汇项的独特信息进行编码。通过操纵这些变量,网络输出特定的词汇项。网络偶尔会输出违反训练数据的创新词汇项,但对于认知建模和神经网络可解释性来说,这些词汇项具有语言学可解释性且信息量丰富。创新的输出表明,网络学习的语音和音系表示可以被创造性地重新组合,并直接与人类语音中的创造性相媲美:在训练数据中从未见过 start 甚至 [st] 序列的情况下,基于 suit 和 dark 输出训练的 fiwGAN 网络生成了 innovative start。我们还认为,将潜在特征代码设置为远远超出训练范围的值几乎会导致原型词汇项的分类生成,并揭示每个潜在代码的基础值。对基于语音中理解良好的依赖关系进行训练的深度神经网络进行探测,对潜在空间可解释性以及理解深度神经网络如何学习有意义的表示具有重要意义,并且在 GAN 框架中具有潜在的无监督文本到语音生成能力。