Center for Robust Speech Systems (CRSS), The University of Texas at Dallas, 800 W. Campbell Road, Richardson, Texas 75080, USA.
J Acoust Soc Am. 2012 Feb;131(2):1515-28. doi: 10.1121/1.3672707.
In this study, the problem of sparse enrollment data for in-set versus out-of-set speaker recognition is addressed. The challenge here is that both the training speaker data (5 s) and test material (2~6 s) is of limited test duration. The limited enrollment data result in a sparse acoustic model space for the desired speaker model. The focus of this study is on filling these acoustic holes by harvesting neighbor speaker information to leverage overall system performance. Acoustically similar speakers are selected from a separate available corpus via three different methods for speaker similarity measurement. The selected data from these similar acoustic speakers are exploited to fill the lack of phone coverage caused by the original sparse enrollment data. The proposed speaker modeling process mimics the naturally distributed acoustic space for conversational speech. The Gaussian mixture model (GMM) tagging process allows simulated natural conversation speech to be included for in-set speaker modeling, which maintains the original system requirement of text independent speaker recognition. A human listener evaluation is also performed to compare machine versus human speaker recognition performance, with machine performance of 95% compared to 72.2% accuracy for human in-set/out-of-set performance. Results show that for extreme sparse train/reference audio streams, human speaker recognition is not nearly as reliable as machine based speaker recognition. The proposed acoustic hole filling solution (MRNC) produces an averaging 7.42% relative improvement over a GMM-Cohort UBM baseline and a 19% relative improvement over the Eigenvoice baseline using the FISHER corpus.
在这项研究中,解决了内集与外集说话人识别中稀疏注册数据的问题。这里的挑战是,训练说话人数据(5 秒)和测试材料(2~6 秒)的测试持续时间都有限。有限的注册数据导致所需说话人模型的声学模型空间稀疏。本研究的重点是通过利用邻居说话人信息来提高整体系统性能来填补这些声学空洞。通过三种不同的说话人相似性测量方法,从单独的可用语料库中选择声学相似的说话人。从这些相似的声学说话人那里选择的数据被用来填补原始稀疏注册数据造成的电话覆盖不足的问题。所提出的说话人建模过程模仿了会话语音的自然分布声学空间。高斯混合模型(GMM)标记过程允许包括模拟的自然对话语音,用于内集说话人建模,这保持了原始系统对文本无关说话人识别的要求。还进行了人类听众评估,以比较机器和人类说话人识别性能,机器性能为 95%,而人类内集/外集性能的准确率为 72.2%。结果表明,对于极端稀疏的训练/参考音频流,人类说话人识别的可靠性远不及基于机器的说话人识别。所提出的声学空洞填充解决方案(MRNC)在 FISHER 语料库上相对于 GMM-Cohort UBM 基线平均提高了 7.42%,相对于 Eigenvoice 基线提高了 19%。