Mallouh Arafat Abu, Qawaqneh Zakariya, Barkana Buket D
1Computer Science and Engineering Department, University of Bridgeport, Bridgeport, CT 06604 USA.
2Electrical Engineering Department, University of Bridgeport, Bridgeport, CT 06604 USA.
Neural Comput Appl. 2018;30(8):2581-2593. doi: 10.1007/s00521-017-2848-4. Epub 2017 Jan 17.
Speaker age and gender classification is one of the most challenging problems in speech signal processing. Recently with developing technologies, identifying speaker age and gender information has become a necessity for speaker verification and identification systems such as identifying suspects in criminal cases, improving human-machine interaction, and adapting music for awaiting people queue. Despite the intensive studies that have been conducted to extract descriptive and distinctive features, the classification accuracies are still not satisfactory. In this work, a model for generating bottleneck features from a deep neural network and a Gaussian Mixture Model-Universal Background Model (GMM-UBM) classifier are proposed for speaker age and gender classification problem. Deep neural network with a bottleneck layer is trained in an unsupervised manner for calculating the initial weights between layers. Then, it is trained and tuned in a supervised manner to generate transformed mel-frequency cepstral coefficients (T-MFCCs). The GMM-UBM is used to build a GMM model for each class, and the models are used to classify speaker age and gender. Age-annotated database of German telephone speech (aGender) is used to evaluate the proposed classification system. The newly generated T-MFCCs have shown potential to achieve significant classification improvements in speaker age and gender classification by using the GMM-UBM classifier. The proposed classification system achieved an overall accuracy of 57.63%. The highest accuracy is calculated as 72.97% for adult female speakers.
说话者年龄和性别分类是语音信号处理中最具挑战性的问题之一。近年来,随着技术的发展,识别说话者的年龄和性别信息已成为说话者验证和识别系统的必要条件,例如在刑事案件中识别嫌疑人、改善人机交互以及为排队等待的人适配音乐。尽管已经进行了大量研究来提取描述性和独特性特征,但分类准确率仍然不尽人意。在这项工作中,提出了一种用于说话者年龄和性别分类问题的、从深度神经网络生成瓶颈特征的模型以及高斯混合模型-通用背景模型(GMM-UBM)分类器。具有瓶颈层的深度神经网络以无监督方式进行训练,以计算层间的初始权重。然后,以有监督方式对其进行训练和调整,以生成变换后的梅尔频率倒谱系数(T-MFCC)。GMM-UBM用于为每个类别构建GMM模型,这些模型用于对说话者的年龄和性别进行分类。使用带有年龄标注的德国电话语音数据库(aGender)来评估所提出的分类系统。新生成的T-MFCC通过使用GMM-UBM分类器,在说话者年龄和性别分类中显示出实现显著分类改进的潜力。所提出的分类系统实现了57.63%的总体准确率。成年女性说话者的最高准确率计算为72.97%。