Electrical Engineering, Indian Institute of Science, Bangalore-560012, India.
J Acoust Soc Am. 2018 Jun;143(6):3352. doi: 10.1121/1.5039750.
A transformation function (TF) that reconstructs neutral speech articulatory trajectories (NATs) from whispered speech articulatory trajectories (WATs) is investigated, such that the dynamic time warped (DTW) distance between the transformed whispered and the original neutral articulatory movements is minimized. Three candidate TFs are considered: an affine function with a diagonal matrix ( A) which reconstructs one NAT from the corresponding WAT, an affine function with a full matrix ( A) and a deep neural network (DNN) based nonlinear function which reconstruct each NAT from all WATs. Experiments reveal that the transformation could be approximated well by A, since it generalizes better across subjects and achieves the least DTW distance of 5.20 (±1.27) mm (on average), with an improvement of 7.47%, 4.76%, and 7.64% (relative) compared to that with A, DNN, and the best baseline scheme, respectively. Further analysis to understand the differences in neutral and whispered articulation reveals that the whispered articulators exhibit exaggerated movements in order to reconstruct the lip movements during neutral speech. It is also observed that among the articulators considered in the study, the tongue exhibits a higher precision and stability while whispering, implying that subjects control their tongue movements carefully in order to render an intelligible whispered speech.
研究了一种从低语语音运动轨迹(WAT)重建中性语音运动轨迹(NAT)的变换函数(TF),使得变换后的低语和原始中性发音运动之间的动态时间扭曲(DTW)距离最小化。考虑了三个候选 TF:一个具有对角矩阵(A)的仿射函数,该函数从相应的 WAT 重建一个 NAT;一个具有全矩阵(A)的仿射函数和一个基于深度神经网络(DNN)的非线性函数,该函数从所有 WAT 重建每个 NAT。实验表明,A 可以很好地逼近变换,因为它在跨主体方面具有更好的泛化能力,并且实现了最小的 DTW 距离 5.20(±1.27)mm(平均值),与 A、DNN 和最佳基线方案相比,分别提高了 7.47%、4.76%和 7.64%(相对)。进一步分析理解中性和低语发音之间的差异表明,低语发音器的运动幅度较大,以便在中性发音时重建唇部运动。还观察到,在所研究的发音器中,舌头在低语时表现出更高的精度和稳定性,这意味着受试者在发出清晰可辨的低语时会仔细控制舌头运动。