Wang Dagen, Narayanan Shrikanth
The authors are with the Viterbi School of Engineering, University of Southern California (USC), Los Angeles, CA 90007 USA.
IEEE Trans Audio Speech Lang Process. 2007 Feb 1;15(2):690-701. doi: 10.1109/tasl.2006.881703.
An algorithm for automatic speech prominence detection is reported in this paper. We describe a comparative analysis on various acoustic features for word prominence detection and report results using a spoken dialog corpus with manually assigned prominence labels. The focus is on features such as spectral intensity and speech rate that are directly extracted from speech based on a correlation-based approach without requiring explicit linguistic or phonetic knowledge. Additionally, various pitch-based measures are studied with respect to their discriminating ability for prominence detection. A parametric scheme for modeling pitch plateau is proposed and this feature alone is found to outperform the traditional local pitch statistics. Two sets of experiments are used to explore the usefulness of the acoustic score generated using these features. The first set focuses on a more traditional way of word prominence detection based on a manually-tagged corpus. A 76.8% classification accuracy was achieved on a corpus of role-playing spoken dialogs. Due to difficulties in manually tagging speech prominence into discrete levels (categories), the second set of experiments focuses on evaluating the score indirectly. Specifically, through experiments on the Switchboard corpus, it is shown that the proposed acoustic score can discriminate between content word and function words in a statistically significant way. The relation between speech prominence and content/function words is also explored. Since prominent words tend to be predominantly content words, and since content words can be automatically marked from text-derived part of speech (POS) information, it is shown that the proposed acoustic score can be indirectly cross-validated through POS information.
本文报道了一种用于自动语音重音检测的算法。我们描述了对用于单词重音检测的各种声学特征的比较分析,并使用带有手动分配重音标签的口语对话语料库报告了结果。重点在于基于相关性方法直接从语音中提取的特征,如频谱强度和语速,这些方法不需要明确的语言或语音知识。此外,还研究了各种基于音高的度量在重音检测方面的辨别能力。提出了一种用于建模音高平台的参数化方案,发现仅这一特征就优于传统的局部音高统计。使用两组实验来探索利用这些特征生成的声学分数的有用性。第一组实验聚焦于基于手动标注语料库的更传统的单词重音检测方式。在角色扮演口语对话语料库上实现了76.8%的分类准确率。由于将语音重音手动标注到离散级别(类别)存在困难,第二组实验重点在于间接评估分数。具体而言,通过在交换机语料库上的实验表明,所提出的声学分数能够以具有统计学意义的方式区分实词和虚词。还探索了语音重音与实词/虚词之间的关系。由于突出的单词往往主要是实词,并且由于可以从文本派生的词性(POS)信息中自动标记实词,结果表明所提出的声学分数可以通过POS信息进行间接交叉验证。