Tao Jidong, Johnson Michael T, Osiejuk Tomasz S
Speech and Signal Processing Laboratory, Marquette University, PO Box 1881, Milwaukee, Wisconsin 53233-1881, USA.
J Acoust Soc Am. 2008 Mar;123(3):1582-90. doi: 10.1121/1.2837487.
Automatic systems for vocalization classification often require fairly large amounts of data on which to train models. However, animal vocalization data collection and transcription is a difficult and time-consuming task, so that it is expensive to create large data sets. One natural solution to this problem is the use of acoustic adaptation methods. Such methods, common in human speech recognition systems, create initial models trained on speaker independent data, then use small amounts of adaptation data to build individual-specific models. Since, as in human speech, individual vocal variability is a significant source of variation in bioacoustic data, acoustic model adaptation is naturally suited to classification in this domain as well. To demonstrate and evaluate the effectiveness of this approach, this paper presents the application of maximum likelihood linear regression adaptation to ortolan bunting (Emberiza hortulana L.) song-type classification. Classification accuracies for the adapted system are computed as a function of the amount of adaptation data and compared to caller-independent and caller-dependent systems. The experimental results indicate that given the same amount of data, supervised adaptation significantly outperforms both caller-independent and caller-dependent systems.
用于发声分类的自动系统通常需要相当大量的数据来训练模型。然而,动物发声数据的收集和转录是一项困难且耗时的任务,因此创建大型数据集成本很高。解决这个问题的一个自然方法是使用声学自适应方法。这种方法在人类语音识别系统中很常见,它先创建基于独立于说话者的数据训练的初始模型,然后使用少量的自适应数据来构建特定个体的模型。由于与人类语音一样,个体发声的变异性是生物声学数据中变异的一个重要来源,声学模型自适应也自然适用于该领域的分类。为了演示和评估这种方法的有效性,本文介绍了最大似然线性回归自适应在圃鹀(Emberiza hortulana L.)歌声类型分类中的应用。根据自适应数据量计算自适应系统的分类准确率,并与独立于呼叫者和依赖于呼叫者的系统进行比较。实验结果表明,在数据量相同的情况下,有监督的自适应明显优于独立于呼叫者和依赖于呼叫者的系统。