PGS Software, 50-086 Wrocław, Poland.
Institute of Data Science and Digital Technologies, Vilnius University, LT-08412 Vilnius, Lithuania.
Sensors (Basel). 2022 Dec 28;23(1):315. doi: 10.3390/s23010315.
Robust Lombard speech-in-noise detecting is challenging. This study proposes a strategy to detect Lombard speech using a machine learning approach for applications such as public address systems that work in near real time. The paper starts with the background concerning the Lombard effect. Then, assumptions of the work performed for Lombard speech detection are outlined. The framework proposed combines convolutional neural networks (CNNs) and various two-dimensional (2D) speech signal representations. To reduce the computational cost and not resign from the 2D representation-based approach, a strategy for threshold-based averaging of the Lombard effect detection results is introduced. The pseudocode of the averaging process is also included. A series of experiments are performed to determine the most effective network structure and the 2D speech signal representation. Investigations are carried out on German and Polish recordings containing Lombard speech. All 2D signal speech representations are tested with and without augmentation. Augmentation means using the alpha channel to store additional data: gender of the speaker, F0 frequency, and first two MFCCs. The experimental results show that Lombard and neutral speech recordings can clearly be discerned, which is done with high detection accuracy. It is also demonstrated that the proposed speech detection process is capable of working in near real-time. These are the key contributions of this work.
鲁棒的 Lombard 语音在噪声中的检测具有挑战性。本研究提出了一种使用机器学习方法检测 Lombard 语音的策略,适用于公共广播系统等实时应用。本文首先介绍了 Lombard 效应的背景。然后,概述了用于 Lombard 语音检测的工作假设。所提出的框架结合了卷积神经网络 (CNN) 和各种二维 (2D) 语音信号表示。为了降低计算成本,同时不放弃基于 2D 表示的方法,引入了基于阈值的 Lombard 效应检测结果平均策略。还包括平均过程的伪代码。进行了一系列实验来确定最有效的网络结构和 2D 语音信号表示。对包含 Lombard 语音的德语和波兰语录音进行了调查。所有 2D 信号语音表示都进行了带和不带扩充的测试。扩充意味着使用 alpha 通道存储附加数据:说话者的性别、F0 频率和前两个 MFCC。实验结果表明,Lombard 语音和中性语音录音可以清晰地区分,并且具有很高的检测准确性。还证明了所提出的语音检测过程能够实时工作。这些是这项工作的主要贡献。