Liu Shuo, Mallol-Ragolta Adria, Yan Tianhao, Qian Kun, Parada-Cabaleiro Emilia, Hu Bin, Schuller Bjorn W
IEEE J Biomed Health Inform. 2022 Aug;26(8):4291-4302. doi: 10.1109/JBHI.2022.3173128. Epub 2022 Aug 11.
The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.
自严重急性呼吸综合征冠状病毒2(SARS-CoV-2,即新冠病毒)疫情爆发以来,检测人们在说话时是否佩戴口罩变得极为重要,因为佩戴口罩有助于减少病毒传播并缓解公共卫生危机。除了影响与频率相关的人类语音特征外,口罩还会在语音中造成时间干扰,改变语速、节奏和发音速度。在这方面,本文提出了两种有效的神经网络模型,用于从音频中检测外科口罩。所提出的架构均基于卷积神经网络(CNN),CNN被选为音频信号空间处理的最佳方法。一种架构应用长短期记忆(LSTM)网络来对时间依赖性进行建模。通过额外的注意力机制,基于LSTM的架构能够提取更显著的时间信息。另一种架构(名为ConvTx)通过变压器模块的位置编码器检索序列的相对位置。为了评估在对时间动态进行建模时这两种架构在多大程度上可以相互补充,我们还在三种混合模型中探索了LSTM和Transformer的组合。最后,我们还研究了数据增强技术,例如使用音频帧之间的过渡以及考虑性别相关框架,是否会影响所提出架构的性能。我们的实验结果表明,其中一种混合模型取得了最佳性能,超过了当前该任务的现有最先进结果。