Amiriparian Shahin, Hübner Tobias, Karas Vincent, Gerczuk Maurice, Ottl Sandra, Schuller Björn W
Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany.
Group on Language, Audio, and Music (GLAM), Imperial College London, London, United Kingdom.
Front Artif Intell. 2022 Mar 17;5:856232. doi: 10.3389/frai.2022.856232. eCollection 2022.
Deep neural speech and audio processing systems have a large number of trainable parameters, a relatively complex architecture, and require a vast amount of training data and computational power. These constraints make it more challenging to integrate such systems into embedded devices and utilize them for real-time, real-world applications. We tackle these limitations by introducing DeepSpectrumLite, an open-source, lightweight transfer learning framework for on-device speech and audio recognition using pre-trained image Convolutional Neural Networks (CNNs). The framework creates and augments Mel spectrogram plots on the fly from raw audio signals which are then used to finetune specific pre-trained CNNs for the target classification task. Subsequently, the whole pipeline can be run in real-time with a mean inference lag of 242.0 ms when a DenseNet121 model is used on a consumer-grade smartphone. DeepSpectrumLite operates decentralized, eliminating the need for data upload for further processing. We demonstrate the suitability of the proposed transfer learning approach for embedded audio signal processing by obtaining state-of-the-art results on a set of paralinguistic and general audio tasks, including speech and music emotion recognition, social signal processing, COVID-19 cough and COVID-19 speech analysis, and snore sound classification. We provide an extensive command-line interface for users and developers which is comprehensively documented and publicly available at https://github.com/DeepSpectrum/DeepSpectrumLite.
深度神经语音和音频处理系统具有大量可训练参数、相对复杂的架构,并且需要大量的训练数据和计算能力。这些限制使得将此类系统集成到嵌入式设备中并将其用于实时的实际应用更具挑战性。我们通过引入DeepSpectrumLite来解决这些限制,它是一个开源的轻量级迁移学习框架,用于使用预训练的图像卷积神经网络(CNN)进行设备端语音和音频识别。该框架从原始音频信号中实时创建并增强梅尔频谱图,然后将其用于微调特定的预训练CNN以完成目标分类任务。随后,当在消费级智能手机上使用DenseNet121模型时,整个管道可以实时运行,平均推理延迟为242.0毫秒。DeepSpectrumLite以分散方式运行,无需上传数据进行进一步处理。我们通过在一组副语言和一般音频任务上取得了领先成果,证明了所提出的迁移学习方法适用于嵌入式音频信号处理,这些任务包括语音和音乐情感识别、社会信号处理、COVID-19咳嗽和COVID-19语音分析以及打鼾声分类。我们为用户和开发人员提供了一个广泛的命令行界面,该界面有全面的文档记录,可在https://github.com/DeepSpectrum/DeepSpectrumLite上公开获取。