Keshishian Menoua, Norman-Haignere Sam V, Mesgarani Nima
Department of Electrical Engineering, Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY 10027.
Adv Neural Inf Process Syst. 2021 Dec;34:24455-24467.
Natural signals such as speech are hierarchically structured across many different timescales, spanning tens (e.g., phonemes) to hundreds (e.g., words) of milliseconds, each of which is highly variable and context-dependent. While deep neural networks (DNNs) excel at recognizing complex patterns from natural signals, relatively little is known about how DNNs flexibly integrate across multiple timescales. Here, we show how a recently developed method for studying temporal integration in biological neural systems - the temporal context invariance (TCI) paradigm - can be used to understand temporal integration in DNNs. The method is simple: we measure responses to a large number of stimulus segments presented in two different contexts and estimate the smallest segment duration needed to achieve a context invariant response. We applied our method to understand how the popular DeepSpeech2 model learns to integrate across time in speech. We find that nearly all of the model units, even in recurrent layers, have a compact integration window within which stimuli substantially alter the response and outside of which stimuli have little effect. We show that training causes these integration windows to shrink at early layers and expand at higher layers, creating a hierarchy of integration windows across the network. Moreover, by measuring integration windows for time-stretched/compressed speech, we reveal a transition point, midway through the trained network, where integration windows become yoked to the duration of stimulus structures (e.g., phonemes or words) rather than absolute time. Similar phenomena were observed in a purely recurrent and purely convolutional network although structure-yoked integration was more prominent in the recurrent network. These findings suggest that deep speech recognition systems use a common motif to encode the hierarchical structure of speech: integrating across short, time-yoked windows at early layers and long, structure-yoked windows at later layers. Our method provides a straightforward and general-purpose toolkit for understanding temporal integration in black-box machine learning models.
诸如语音之类的自然信号在许多不同的时间尺度上具有层次结构,范围从数十毫秒(例如音素)到数百毫秒(例如单词),每个时间尺度都高度可变且依赖于上下文。虽然深度神经网络(DNN)在从自然信号中识别复杂模式方面表现出色,但对于DNN如何在多个时间尺度上灵活整合却知之甚少。在这里,我们展示了一种最近开发的用于研究生物神经系统中时间整合的方法——时间上下文不变性(TCI)范式——如何用于理解DNN中的时间整合。该方法很简单:我们测量对在两种不同上下文中呈现的大量刺激片段的反应,并估计实现上下文不变反应所需的最小片段持续时间。我们应用我们的方法来理解流行的DeepSpeech2模型如何学习在语音中跨时间进行整合。我们发现,几乎所有模型单元,即使是循环层中的单元,都有一个紧凑的整合窗口,在该窗口内刺激会显著改变反应,而在窗口外刺激几乎没有影响。我们表明,训练会导致这些整合窗口在早期层收缩,在较高层扩展,从而在整个网络中创建一个整合窗口层次结构。此外,通过测量时间拉伸/压缩语音的整合窗口,我们揭示了一个过渡点,在训练网络的中途,整合窗口与刺激结构(例如音素或单词)的持续时间而不是绝对时间相关联。在纯循环网络和纯卷积网络中也观察到了类似现象,尽管结构关联整合在循环网络中更为突出。这些发现表明,深度语音识别系统使用一种共同的模式来编码语音的层次结构:在早期层通过短的、时间关联的窗口进行整合,在后期层通过长的、结构关联的窗口进行整合。我们的方法为理解黑箱机器学习模型中的时间整合提供了一个直接且通用的工具包。