用于婴儿和母亲发声分类的声学和语音质量特征分析。

Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations.

作者信息

Li Jialu, Hasegawa-Johnson Mark, McElwain Nancy L

机构信息

Beckman Institute, University of Illinois, Urbana, IL 61801, USA.

Department of Electrical and Computer Engineering, USA.

出版信息

Speech Commun. 2021 Oct;133:41-61. doi: 10.1016/j.specom.2021.07.010. Epub 2021 Aug 18.

DOI:10.1016/j.specom.2021.07.010

PMID:36062214

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9435967/

Abstract

Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature.

摘要

对婴儿和父母的发声，尤其是情感发声进行分类，对于理解婴儿如何在社会二元互动过程中学会调节情绪至关重要。这项工作是一项关于分类器、特征和数据增强策略的实验研究，应用于对婴儿和父母发声类型进行分类的任务。我们的数据在家庭和实验室中都有记录。婴儿发声被手动标记为哭（cry）、烦躁（fus，fuss）、笑（lau，laugh）、咿呀学语（bab，babble）或尖叫（scr，screech），而父母（主要是母亲）发声被标记为对婴儿说话（ids，infant-directed speech）、对成人说话（ads，adult-directed speech）、嬉戏（pla，playful）、有节奏的言语或唱歌（rhy，rhythmic speech or singing）、笑（lau，laugh）或低语（whi，whisper）。线性判别分析（LDA）被选为基线分类器，因为在先前发表的涵盖该语料库部分内容的研究中它给出了最高的准确率。将LDA与两种神经网络架构进行了比较：一个两层全连接网络（FCN）和一个带有自注意力的卷积神经网络（CNSA）。使用OpenSMILE工具包提取的基线特征通过额外的语音质量、语音和韵律特征进行了增强，每个特征都针对一种或多种发声类型感知特征。测试了三种网络数据增强和迁移学习方法：针对相关任务（成人情感分类）对网络权重进行预训练、使用从其他语料库中均匀采样的数据对代表性不足的类别进行增强，以及使用通过最小跨语料库信息差异标准选择的数据对代表性不足的类别进行增强。还测试了使用Fisher分数进行特征选择以及使用加权和未加权采样器的实验。评估了两个数据集：一个基准数据集（CRIED）和我们自己的语料库。就CRIED数据集的未加权平均召回率而言，与先前的研究相比，CNSA实现了最佳的未加权平均召回率（UAR）。就我们自己数据集的分类准确率、加权F1和宏F1而言，神经网络均显著优于LDA；FCN略优于（但不显著）CNSA。对不同特征选择算法选择的特征进行交叉检验允许进行一种事后特征分析，其中列出了每种二元类型区分最重要的声学特征。选择了每种发声类型重叠特征的示例，并展示了它们的频谱图，并针对各种算法选择的类型区分声学特征进行了讨论。发现梅尔频率倒谱系数（MFCC）、对数梅尔频带能量、线性预测倒谱系数（LSP）频率和第一共振峰（F1）是最重要的频谱包络特征；基频（F0）是最重要的韵律特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/849a22cc92ad/nihms-1777449-f0001.jpg

相似文献

Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations.

Speech Commun. 2021 Oct;133:41-61. doi: 10.1016/j.specom.2021.07.010. Epub 2021 Aug 18.

Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest.

PLoS One. 2023 Nov 21;18(11):e0291500. doi: 10.1371/journal.pone.0291500. eCollection 2023.

Effect on speech emotion classification of a feature selection approach using a convolutional neural network.

PeerJ Comput Sci. 2021 Nov 3;7:e766. doi: 10.7717/peerj-cs.766. eCollection 2021.

Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network.

Sensors (Basel). 2020 Oct 23;20(21):6008. doi: 10.3390/s20216008.

DCNN for Pig Vocalization and Non-Vocalization Classification: Evaluate Model Robustness with New Data.

Animals (Basel). 2024 Jul 9;14(14):2029. doi: 10.3390/ani14142029.

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition.

Sensors (Basel). 2020 Sep 28;20(19):5559. doi: 10.3390/s20195559.

Infant-adult vocal interaction dynamics depend on infant vocal type, child-directedness of adult speech, and timeframe.

Infant Behav Dev. 2019 Nov;57:101325. doi: 10.1016/j.infbeh.2019.04.007. Epub 2019 May 14.

Data-driven automated acoustic analysis of human infant vocalizations using neural network tools.

J Acoust Soc Am. 2010 Apr;127(4):2563-77. doi: 10.1121/1.3327460.

Is infant-directed speech interesting because it is surprising? - Linking properties of IDS to statistical learning and attention at the prosodic level.

Cognition. 2018 Sep;178:193-206. doi: 10.1016/j.cognition.2018.05.015. Epub 2018 Jun 6.

Classification of Infant Cry Based on Hybrid Audio Features and ResLSTM.

J Voice. 2024 Sep 20. doi: 10.1016/j.jvoice.2024.08.022.

引用本文的文献

Preliminary Technical Validation of LittleBeats™: A Multimodal Sensing Platform to Capture Cardiac Physiology, Motion, and Vocalizations.

Sensors (Basel). 2024 Jan 30;24(3):901. doi: 10.3390/s24030901.

Evaluating Users' Experiences of a Child Multimodal Wearable Device: Mixed Methods Approach.

JMIR Hum Factors. 2024 Feb 8;11:e49316. doi: 10.2196/49316.

Emerging Verbal Functions in Early Infancy: Lessons from Observational and Computational Approaches on Typical Development and Neurodevelopmental Disorders.

Adv Neurodev Disord. 2022 Dec;6(4):369-388. doi: 10.1007/s41252-022-00300-7. Epub 2022 Oct 25.

A Multistage Heterogeneous Stacking Ensemble Model for Augmented Infant Cry Classification.

Front Public Health. 2022 Mar 24;10:819865. doi: 10.3389/fpubh.2022.819865. eCollection 2022.

本文引用的文献

Infant-Directed Speech Facilitates Word Segmentation.

Infancy. 2005 Jan;7(1):53-71. doi: 10.1207/s15327078in0701_5. Epub 2005 Jan 1.

Infant-adult vocal interaction dynamics depend on infant vocal type, child-directedness of adult speech, and timeframe.

Infant Behav Dev. 2019 Nov;57:101325. doi: 10.1016/j.infbeh.2019.04.007. Epub 2019 May 14.

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English.

PLoS One. 2018 May 16;13(5):e0196391. doi: 10.1371/journal.pone.0196391. eCollection 2018.

Infant-directed speech from seven to nineteen months has similar acoustic properties but different functions.

J Child Lang. 2018 Sep;45(5):1035-1053. doi: 10.1017/S0305000917000629. Epub 2018 Mar 5.

A Novel Way to Measure and Predict Development: A Heuristic Approach to Facilitate the Early Detection of Neurodevelopmental Disorders.

Curr Neurol Neurosci Rep. 2017 May;17(5):43. doi: 10.1007/s11910-017-0748-8.

Automated analysis of child phonetic production using naturalistic recordings.

J Speech Lang Hear Res. 2014 Oct;57(5):1638-50. doi: 10.1044/2014_JSLHR-S-13-0037.

A flexible analysis tool for the quantitative acoustic assessment of infant cry.

J Speech Lang Hear Res. 2013 Oct;56(5):1416-28. doi: 10.1044/1092-4388(2013/11-0298). Epub 2013 Jun 19.

Discrimination between mothers' infant- and adult-directed speech using hidden Markov models.

Neurosci Res. 2011 May;70(1):62-70. doi: 10.1016/j.neures.2011.01.010. Epub 2011 Jan 21.

Data-driven automated acoustic analysis of human infant vocalizations using neural network tools.

J Acoust Soc Am. 2010 Apr;127(4):2563-77. doi: 10.1121/1.3327460.

Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy.

IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1226-38. doi: 10.1109/TPAMI.2005.159.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr
超能文献

用于婴儿和母亲发声分类的声学和语音质量特征分析。

Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr超能文献

用于婴儿和母亲发声分类的声学和语音质量特征分析。

Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr
超能文献