Wang Dagen, Narayanan Shrikanth S
Viterbi School of Engineering, University of Southern California (USC), Los Angeles, CA 90007 USA. He is now with the IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 USA.
IEEE Trans Audio Speech Lang Process. 2007 Nov 1;15(8):2190-2201. doi: 10.1109/TASL.2007.905178.
In this paper, we propose a direct method for speech rate estimation from acoustic features without requiring any automatic speech transcription. We compare various spectral and temporal signal analysis and smoothing strategies to better characterize the underlying syllable structure to derive speech rate. The proposed algorithm extends the methods of spectral subband correlation by including temporal correlation and the use of prominent spectral subbands for improving the signal correlation essential for syllable detection. Furthermore, to address some of the practical robustness issues in previously proposed methods, we introduce some novel components into the algorithm such as the use of pitch confidence for filtering spurious syllable envelope peaks, magnifying window for tackling neighboring syllable smearing, and relative peak measure thresholds for pseudo peak rejection. We also describe an automated approach for learning algorithm parameters from data, and find the optimal settings through Monte Carlo simulations and parameter sensitivity analysis. Final experimental evaluations are conducted based on a portion of the Switchboard corpus for which manual phonetic segmentation information, and published results for direct comparison are available. The results show a correlation coefficient of 0.745 with respect to the ground truth based on manual segmentation. This result is about a 17% improvement compared to the current best single estimator and a 11% improvement over the multiestimator evaluated on the same Switchboard database.
在本文中,我们提出了一种直接从声学特征估计语速的方法,无需任何自动语音转录。我们比较了各种频谱和时间信号分析及平滑策略,以更好地表征潜在的音节结构,从而得出语速。所提出的算法通过纳入时间相关性以及使用突出的频谱子带来扩展频谱子带相关性方法,以改善音节检测所需的信号相关性。此外,为了解决先前提出的方法中的一些实际鲁棒性问题,我们在算法中引入了一些新颖的组件,例如使用音高置信度来过滤虚假的音节包络峰值、使用放大窗口来处理相邻音节的模糊以及使用相对峰值测量阈值来拒绝伪峰值。我们还描述了一种从数据中学习算法参数的自动化方法,并通过蒙特卡罗模拟和参数敏感性分析找到最佳设置。最终的实验评估基于Switchboard语料库的一部分进行,该部分语料库具有手动语音分割信息,并且有已发表的结果可供直接比较。结果表明,相对于基于手动分割的真实情况,相关系数为0.745。与当前最佳的单一估计器相比,这一结果提高了约17%,与在同一Switchboard数据库上评估的多估计器相比提高了11%。