Li Jialu, Hasegawa-Johnson Mark, McElwain Nancy L
Beckman Institute, University of Illinois, Urbana, IL 61801, USA.
Department of Electrical and Computer Engineering, USA.
Speech Commun. 2021 Oct;133:41-61. doi: 10.1016/j.specom.2021.07.010. Epub 2021 Aug 18.
Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature.
对婴儿和父母的发声,尤其是情感发声进行分类,对于理解婴儿如何在社会二元互动过程中学会调节情绪至关重要。这项工作是一项关于分类器、特征和数据增强策略的实验研究,应用于对婴儿和父母发声类型进行分类的任务。我们的数据在家庭和实验室中都有记录。婴儿发声被手动标记为哭(cry)、烦躁(fus,fuss)、笑(lau,laugh)、咿呀学语(bab,babble)或尖叫(scr,screech),而父母(主要是母亲)发声被标记为对婴儿说话(ids,infant-directed speech)、对成人说话(ads,adult-directed speech)、嬉戏(pla,playful)、有节奏的言语或唱歌(rhy,rhythmic speech or singing)、笑(lau,laugh)或低语(whi,whisper)。线性判别分析(LDA)被选为基线分类器,因为在先前发表的涵盖该语料库部分内容的研究中它给出了最高的准确率。将LDA与两种神经网络架构进行了比较:一个两层全连接网络(FCN)和一个带有自注意力的卷积神经网络(CNSA)。使用OpenSMILE工具包提取的基线特征通过额外的语音质量、语音和韵律特征进行了增强,每个特征都针对一种或多种发声类型感知特征。测试了三种网络数据增强和迁移学习方法:针对相关任务(成人情感分类)对网络权重进行预训练、使用从其他语料库中均匀采样的数据对代表性不足的类别进行增强,以及使用通过最小跨语料库信息差异标准选择的数据对代表性不足的类别进行增强。还测试了使用Fisher分数进行特征选择以及使用加权和未加权采样器的实验。评估了两个数据集:一个基准数据集(CRIED)和我们自己的语料库。就CRIED数据集的未加权平均召回率而言,与先前的研究相比,CNSA实现了最佳的未加权平均召回率(UAR)。就我们自己数据集的分类准确率、加权F1和宏F1而言,神经网络均显著优于LDA;FCN略优于(但不显著)CNSA。对不同特征选择算法选择的特征进行交叉检验允许进行一种事后特征分析,其中列出了每种二元类型区分最重要的声学特征。选择了每种发声类型重叠特征的示例,并展示了它们的频谱图,并针对各种算法选择的类型区分声学特征进行了讨论。发现梅尔频率倒谱系数(MFCC)、对数梅尔频带能量、线性预测倒谱系数(LSP)频率和第一共振峰(F1)是最重要的频谱包络特征;基频(F0)是最重要的韵律特征。