Tsai Wei-Ho, Wang Hsin-Min
Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan.
J Acoust Soc Am. 2006 Sep;120(3):1631-45. doi: 10.1121/1.2225570.
This paper investigates the problem of how to partition unknown speech utterances into a set of clusters, such that each cluster consists of utterances from only one speaker, and the number of clusters reflects the unknown speaker population size. The proposed method begins by specifying a certain number of clusters, corresponding to one of the possible speaker population sizes, and then maximizes the level of overall within-cluster homogeneity of the speakers' voice characteristics. The within-cluster homogeneity is characterized by the likelihood probability that a cluster model, trained using all the utterances within a cluster, matches each of the within-cluster utterances. To attain the maximal sum of likelihood probabilities for all utterances, the proposed method applies a genetic algorithm to determine the cluster in which each utterance should be located. For greater computational efficiency, also proposed is a clustering criterion that approximates the likelihood probability with a divergence-based model similarity between a cluster and each of the within-cluster utterances. The clustering method then examines various legitimate numbers of clusters by adapting the Bayesian information criterion to determine the most likely speaker population size. The experimental results show the superiority of the proposed method over conventional methods based on hierarchical clustering.
本文研究了如何将未知语音话语划分为一组簇的问题,使得每个簇仅由来自一个说话者的话语组成,并且簇的数量反映未知说话者群体的规模。所提出的方法首先指定一定数量的簇,对应于可能的说话者群体规模之一,然后最大化说话者语音特征的整体簇内同质性水平。簇内同质性由使用簇内所有话语训练的簇模型与每个簇内话语匹配的似然概率来表征。为了获得所有话语的最大似然概率之和,所提出的方法应用遗传算法来确定每个话语应位于的簇。为了提高计算效率,还提出了一种聚类准则,该准则用基于散度的簇与每个簇内话语之间的模型相似性来近似似然概率。然后,聚类方法通过调整贝叶斯信息准则来检查各种合法的簇数量,以确定最可能的说话者群体规模。实验结果表明,所提出的方法优于基于层次聚类的传统方法。