使用发声模型从录制的语音信号中获取声带振荡信息。

Deriving Vocal Fold Oscillation Information from Recorded Voice Signals Using Models of Phonation.

作者信息

Zhao Wayne, Singh Rita

机构信息

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

出版信息

Entropy (Basel). 2023 Jul 10;25(7):1039. doi: 10.3390/e25071039.

DOI:10.3390/e25071039

PMID:37509986

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10378572/

Abstract

During phonation, the vocal folds exhibit a self-sustained oscillatory motion, which is influenced by the physical properties of the speaker's vocal folds and driven by the balance of bio-mechanical and aerodynamic forces across the glottis. Subtle changes in the speaker's physical state can affect voice production and alter these oscillatory patterns. Measuring these can be valuable in developing computational tools that analyze voice to infer the speaker's state. Traditionally, vocal fold oscillations (VFOs) are measured directly using physical devices in clinical settings. In this paper, we propose a novel analysis-by-synthesis approach that allows us to infer the VFOs directly from recorded speech signals on an individualized, speaker-by-speaker basis. The approach, called the ADLES-VFT algorithm, is proposed in the context of a joint model that combines a phonation model (with a glottal flow waveform as the output) and a vocal tract acoustic wave propagation model such that the output of the joint model is an estimated waveform. The ADLES-VFT algorithm is a forward-backward algorithm which minimizes the error between the recorded waveform and the output of this joint model to estimate its parameters. Once estimated, these parameter values are used in conjunction with a phonation model to obtain its solutions. Since the parameters correlate with the physical properties of the vocal folds of the speaker, model solutions obtained using them represent the individualized VFOs for each speaker. The approach is flexible and can be applied to various phonation models. In addition to presenting the methodology, we show how the VFOs can be quantified from a dynamical systems perspective for classification purposes. Mathematical derivations are provided in an appendix for better readability.

摘要

在发声过程中，声带呈现出自维持振荡运动，这种运动受说话者声带物理特性的影响，并由声门处生物力学和空气动力学力量的平衡驱动。说话者身体状态的细微变化会影响语音产生并改变这些振荡模式。测量这些对于开发通过分析语音来推断说话者状态的计算工具很有价值。传统上，在临床环境中使用物理设备直接测量声带振荡（VFO）。在本文中，我们提出了一种新颖的分析合成方法，使我们能够在个体逐说话者的基础上直接从录制的语音信号中推断VFO。这种方法称为ADLES - VFT算法，是在一个联合模型的背景下提出的，该联合模型结合了一个发声模型（以声门流波形作为输出）和一个声道声波传播模型，使得联合模型的输出是一个估计波形。ADLES - VFT算法是一种前向 - 后向算法，它最小化录制波形与该联合模型输出之间的误差以估计其参数。一旦估计出来，这些参数值将与发声模型结合使用以获得其解。由于这些参数与说话者声带的物理特性相关，使用它们获得的模型解代表每个说话者的个体化VFO。该方法灵活且可应用于各种发声模型。除了介绍方法之外，我们还展示了如何从动态系统的角度对VFO进行量化以用于分类目的。附录中提供了数学推导以提高可读性。