Department of Experimental Psychology, University of Cambridge.
Department of Engineering, University of Cambridge.
Trends Hear. 2020 Jan-Dec;24:2331216520943074. doi: 10.1177/2331216520943074.
The "time-varying loudness" (TVL) model of Glasberg and Moore calculates "instantaneous loudness" every 1 ms, and this is used to generate predictions of short-term loudness, the loudness of a short segment of sound, such as a word in a sentence, and of long-term loudness, the loudness of a longer segment of sound, such as a whole sentence. The calculation of instantaneous loudness is computationally intensive and real-time implementation of the TVL model is difficult. To speed up the computation, a deep neural network (DNN) was trained to predict instantaneous loudness using a large database of speech sounds and artificial sounds (tones alone and tones in white or pink noise), with the predictions of the TVL model as a reference (providing the "correct" answer, specifically the loudness level in phons). A multilayer perceptron with three hidden layers was found to be sufficient, with more complex DNN architecture not yielding higher accuracy. After training, the deviations between the predictions of the TVL model and the predictions of the DNN were typically less than 0.5 phons, even for types of sounds that were not used for training (music, rain, animal sounds, and washing machine). The DNN calculates instantaneous loudness over 100 times more quickly than the TVL model. Possible applications of the DNN are discussed.
格拉斯伯格和摩尔的“时变响度”(TVL)模型每 1ms 计算一次“即时响度”,并据此生成短期响度的预测值,即短段声音(如句子中的一个单词)的响度,以及长期响度的预测值,即长段声音(如整个句子)的响度。即时响度的计算计算量很大,实时实现 TVL 模型很困难。为了加快计算速度,使用包含语音和人工声音(纯音和白噪声或粉红噪声中的纯音)的大型数据库训练了一个深度神经网络(DNN)来预测即时响度,TVL 模型的预测值作为参考(提供“正确”答案,即音分值)。研究发现,具有三个隐藏层的多层感知器就足够了,更复杂的 DNN 架构并没有提高准确性。训练后,TVL 模型和 DNN 的预测值之间的偏差通常小于 0.5 音分,即使对于未用于训练的声音类型(音乐、雨声、动物声和洗衣机声)也是如此。DNN 计算即时响度的速度比 TVL 模型快 100 多倍。讨论了 DNN 的可能应用。