Lozano Andrés, Nava Enrique, García Méndez María Dolores, Moreno-Torres Ignacio
Department of Communication Engineering, University of Málaga, Málaga, Spain.
Department of Spanish Philology, University of Málaga, Málaga, Spain.
PLoS One. 2024 Dec 31;19(12):e0315452. doi: 10.1371/journal.pone.0315452. eCollection 2024.
Nasalance is a valuable clinical biomarker for hypernasality. It is computed as the ratio of acoustic energy emitted through the nose to the total energy emitted through the mouth and nose (eNasalance). A new approach is proposed to compute nasalance using Convolutional Neural Networks (CNNs) trained with Mel-Frequency Cepstrum Coefficients (mfccNasalance). mfccNasalance is evaluated by examining its accuracy: 1) when the train and test data are from the same or from different dialects; 2) with test data that differs in dynamicity (e.g. rapidly produced diadochokinetic syllables versus short words); and 3) using multiple CNN configurations (i.e. kernel shape and use of 1 × 1 pointwise convolution). Dual-channel Nasometer speech data from healthy speakers from different dialects: Costa Rica, more(+) nasal, Spain and Chile, less(-) nasal, are recorded. The input to the CNN models were sequences of 39 MFCC vectors computed from 250 ms moving windows. The test data were recorded in Spain and included short words (-dynamic), sentences (+dynamic), and diadochokinetic syllables (+dynamic). The accuracy of a CNN model was defined as the Spearman correlation between the mfccNasalance for that model and the perceptual nasality scores of human experts. In the same-dialect condition, mfccNasalance was more accurate than eNasalance independently of the CNN configuration; using a 1 × 1 kernel resulted in increased accuracy for +dynamic utterances (p < .000), though not for -dynamic utterances. The kernel shape had a significant impact for -dynamic utterances (p < .000) exclusively. In the different-dialect condition, the scores were significantly less accurate than in the same-dialect condition, particularly for Costa Rica trained models. We conclude that mfccNasalance is a flexible and useful alternative to eNasalance. Future studies should explore how to optimize mfccNasalance by selecting the most adequate CNN model as a function of the dynamicity of the target speech data.
鼻声度是评估鼻音过重的一项重要临床生物标志物。它通过计算经鼻腔发出的声能与经口腔和鼻腔发出的总声能之比得出(即电子鼻声度)。本文提出了一种新方法,利用基于梅尔频率倒谱系数训练的卷积神经网络(CNN)来计算鼻声度(mfcc鼻声度)。通过以下方式评估mfcc鼻声度的准确性:1)训练数据和测试数据来自相同方言或不同方言时;2)测试数据在动态性方面有所不同时(例如快速发出的重复音节与简短词汇);3)使用多种CNN配置时(即内核形状和1×1逐点卷积的使用)。记录了来自不同方言地区健康受试者的双通道鼻声计语音数据:来自鼻化程度较高的哥斯达黎加、鼻化程度较低的西班牙和智利。CNN模型的输入是从250毫秒移动窗口计算得出的39个MFCC向量序列。测试数据在西班牙录制,包括简短词汇(动态性低)、句子(动态性高)和重复音节(动态性高)。CNN模型的准确性定义为该模型的mfcc鼻声度与人类专家的感知鼻音评分之间的斯皮尔曼相关性。在相同方言条件下,无论CNN配置如何,mfcc鼻声度都比电子鼻声度更准确;使用1×1内核可提高动态性高的语音的准确性(p < .000),但对动态性低的语音无效。内核形状仅对动态性低的语音有显著影响(p < .000)。在不同方言条件下,得分的准确性明显低于相同方言条件,尤其是对于在哥斯达黎加训练的模型。我们得出结论,mfcc鼻声度是电子鼻声度的一种灵活且有用的替代方法。未来的研究应探索如何根据目标语音数据的动态性选择最合适的CNN模型,以优化mfcc鼻声度。