• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于婴儿和母亲发声分类的声学和语音质量特征分析。

Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations.

作者信息

Li Jialu, Hasegawa-Johnson Mark, McElwain Nancy L

机构信息

Beckman Institute, University of Illinois, Urbana, IL 61801, USA.

Department of Electrical and Computer Engineering, USA.

出版信息

Speech Commun. 2021 Oct;133:41-61. doi: 10.1016/j.specom.2021.07.010. Epub 2021 Aug 18.

DOI:10.1016/j.specom.2021.07.010
PMID:36062214
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9435967/
Abstract

Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature.

摘要

对婴儿和父母的发声,尤其是情感发声进行分类,对于理解婴儿如何在社会二元互动过程中学会调节情绪至关重要。这项工作是一项关于分类器、特征和数据增强策略的实验研究,应用于对婴儿和父母发声类型进行分类的任务。我们的数据在家庭和实验室中都有记录。婴儿发声被手动标记为哭(cry)、烦躁(fus,fuss)、笑(lau,laugh)、咿呀学语(bab,babble)或尖叫(scr,screech),而父母(主要是母亲)发声被标记为对婴儿说话(ids,infant-directed speech)、对成人说话(ads,adult-directed speech)、嬉戏(pla,playful)、有节奏的言语或唱歌(rhy,rhythmic speech or singing)、笑(lau,laugh)或低语(whi,whisper)。线性判别分析(LDA)被选为基线分类器,因为在先前发表的涵盖该语料库部分内容的研究中它给出了最高的准确率。将LDA与两种神经网络架构进行了比较:一个两层全连接网络(FCN)和一个带有自注意力的卷积神经网络(CNSA)。使用OpenSMILE工具包提取的基线特征通过额外的语音质量、语音和韵律特征进行了增强,每个特征都针对一种或多种发声类型感知特征。测试了三种网络数据增强和迁移学习方法:针对相关任务(成人情感分类)对网络权重进行预训练、使用从其他语料库中均匀采样的数据对代表性不足的类别进行增强,以及使用通过最小跨语料库信息差异标准选择的数据对代表性不足的类别进行增强。还测试了使用Fisher分数进行特征选择以及使用加权和未加权采样器的实验。评估了两个数据集:一个基准数据集(CRIED)和我们自己的语料库。就CRIED数据集的未加权平均召回率而言,与先前的研究相比,CNSA实现了最佳的未加权平均召回率(UAR)。就我们自己数据集的分类准确率、加权F1和宏F1而言,神经网络均显著优于LDA;FCN略优于(但不显著)CNSA。对不同特征选择算法选择的特征进行交叉检验允许进行一种事后特征分析,其中列出了每种二元类型区分最重要的声学特征。选择了每种发声类型重叠特征的示例,并展示了它们的频谱图,并针对各种算法选择的类型区分声学特征进行了讨论。发现梅尔频率倒谱系数(MFCC)、对数梅尔频带能量、线性预测倒谱系数(LSP)频率和第一共振峰(F1)是最重要的频谱包络特征;基频(F0)是最重要的韵律特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/ab60dee91eb6/nihms-1777449-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/849a22cc92ad/nihms-1777449-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/19cddf8c7e36/nihms-1777449-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/e7284f937291/nihms-1777449-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/4c0267774880/nihms-1777449-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/c577412a29fc/nihms-1777449-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/9780b8e3fb60/nihms-1777449-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/593feeb063da/nihms-1777449-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/ab60dee91eb6/nihms-1777449-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/849a22cc92ad/nihms-1777449-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/19cddf8c7e36/nihms-1777449-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/e7284f937291/nihms-1777449-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/4c0267774880/nihms-1777449-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/c577412a29fc/nihms-1777449-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/9780b8e3fb60/nihms-1777449-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/593feeb063da/nihms-1777449-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ba4/9435967/ab60dee91eb6/nihms-1777449-f0008.jpg

相似文献

1
Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations.用于婴儿和母亲发声分类的声学和语音质量特征分析。
Speech Commun. 2021 Oct;133:41-61. doi: 10.1016/j.specom.2021.07.010. Epub 2021 Aug 18.
2
Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest.基于机器学习技术的语音情感识别:卷积神经网络和随机森林的特征提取与比较。
PLoS One. 2023 Nov 21;18(11):e0291500. doi: 10.1371/journal.pone.0291500. eCollection 2023.
3
Effect on speech emotion classification of a feature selection approach using a convolutional neural network.使用卷积神经网络的特征选择方法对语音情感分类的影响。
PeerJ Comput Sci. 2021 Nov 3;7:e766. doi: 10.7717/peerj-cs.766. eCollection 2021.
4
Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network.基于深度卷积神经网络的特征选择算法对语音情感识别的影响。
Sensors (Basel). 2020 Oct 23;20(21):6008. doi: 10.3390/s20216008.
5
DCNN for Pig Vocalization and Non-Vocalization Classification: Evaluate Model Robustness with New Data.用于猪发声与非发声分类的深度卷积神经网络:使用新数据评估模型稳健性
Animals (Basel). 2024 Jul 9;14(14):2029. doi: 10.3390/ani14142029.
6
Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition.融合视觉注意 CNN 和视觉词袋用于跨语料库语音情感识别。
Sensors (Basel). 2020 Sep 28;20(19):5559. doi: 10.3390/s20195559.
7
Infant-adult vocal interaction dynamics depend on infant vocal type, child-directedness of adult speech, and timeframe.婴儿-成人的声音互动动态取决于婴儿的声音类型、成人言语的儿童指向性以及时间范围。
Infant Behav Dev. 2019 Nov;57:101325. doi: 10.1016/j.infbeh.2019.04.007. Epub 2019 May 14.
8
Data-driven automated acoustic analysis of human infant vocalizations using neural network tools.基于神经网络工具的人类婴儿声音数据驱动自动化声学分析。
J Acoust Soc Am. 2010 Apr;127(4):2563-77. doi: 10.1121/1.3327460.
9
Is infant-directed speech interesting because it is surprising? - Linking properties of IDS to statistical learning and attention at the prosodic level.婴儿导向语是否有趣是因为它令人惊讶?——将 IDS 的特性与韵律层面的统计学习和注意力联系起来。
Cognition. 2018 Sep;178:193-206. doi: 10.1016/j.cognition.2018.05.015. Epub 2018 Jun 6.
10
Classification of Infant Cry Based on Hybrid Audio Features and ResLSTM.基于混合音频特征和残差长短期记忆网络的婴儿哭声分类
J Voice. 2024 Sep 20. doi: 10.1016/j.jvoice.2024.08.022.

引用本文的文献

1
Preliminary Technical Validation of LittleBeats™: A Multimodal Sensing Platform to Capture Cardiac Physiology, Motion, and Vocalizations.LittleBeatsTM 的初步技术验证:一种用于捕捉心脏生理、运动和发声的多模态感应平台。
Sensors (Basel). 2024 Jan 30;24(3):901. doi: 10.3390/s24030901.
2
Evaluating Users' Experiences of a Child Multimodal Wearable Device: Mixed Methods Approach.评估儿童多模态可穿戴设备的用户体验:混合方法研究
JMIR Hum Factors. 2024 Feb 8;11:e49316. doi: 10.2196/49316.
3
Emerging Verbal Functions in Early Infancy: Lessons from Observational and Computational Approaches on Typical Development and Neurodevelopmental Disorders.

本文引用的文献

1
Infant-Directed Speech Facilitates Word Segmentation.面向婴儿的言语有助于单词切分。
Infancy. 2005 Jan;7(1):53-71. doi: 10.1207/s15327078in0701_5. Epub 2005 Jan 1.
2
Infant-adult vocal interaction dynamics depend on infant vocal type, child-directedness of adult speech, and timeframe.婴儿-成人的声音互动动态取决于婴儿的声音类型、成人言语的儿童指向性以及时间范围。
Infant Behav Dev. 2019 Nov;57:101325. doi: 10.1016/j.infbeh.2019.04.007. Epub 2019 May 14.
3
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English.
早期婴儿期新兴的语言功能:来自典型发育和神经发育障碍的观察与计算方法的经验教训。
Adv Neurodev Disord. 2022 Dec;6(4):369-388. doi: 10.1007/s41252-022-00300-7. Epub 2022 Oct 25.
4
A Multistage Heterogeneous Stacking Ensemble Model for Augmented Infant Cry Classification.多阶段异质堆叠集成模型在增强型婴儿哭声分类中的应用。
Front Public Health. 2022 Mar 24;10:819865. doi: 10.3389/fpubh.2022.819865. eCollection 2022.
瑞尔森情感语音和歌曲音频视频数据库(RAVDESS):一组具有北美英语特色的动态、多模态面部和声音表情数据集。
PLoS One. 2018 May 16;13(5):e0196391. doi: 10.1371/journal.pone.0196391. eCollection 2018.
4
Infant-directed speech from seven to nineteen months has similar acoustic properties but different functions.7至19个月大婴儿的儿向言语具有相似的声学特性,但功能不同。
J Child Lang. 2018 Sep;45(5):1035-1053. doi: 10.1017/S0305000917000629. Epub 2018 Mar 5.
5
A Novel Way to Measure and Predict Development: A Heuristic Approach to Facilitate the Early Detection of Neurodevelopmental Disorders.一种测量和预测发育的新方法:一种促进神经发育障碍早期检测的启发式方法。
Curr Neurol Neurosci Rep. 2017 May;17(5):43. doi: 10.1007/s11910-017-0748-8.
6
Automated analysis of child phonetic production using naturalistic recordings.使用自然主义录音对儿童语音产生进行自动化分析。
J Speech Lang Hear Res. 2014 Oct;57(5):1638-50. doi: 10.1044/2014_JSLHR-S-13-0037.
7
A flexible analysis tool for the quantitative acoustic assessment of infant cry.一种用于婴儿哭声定量声学评估的灵活分析工具。
J Speech Lang Hear Res. 2013 Oct;56(5):1416-28. doi: 10.1044/1092-4388(2013/11-0298). Epub 2013 Jun 19.
8
Discrimination between mothers' infant- and adult-directed speech using hidden Markov models.使用隐马尔可夫模型区分母亲针对婴儿和成人的言语。
Neurosci Res. 2011 May;70(1):62-70. doi: 10.1016/j.neures.2011.01.010. Epub 2011 Jan 21.
9
Data-driven automated acoustic analysis of human infant vocalizations using neural network tools.基于神经网络工具的人类婴儿声音数据驱动自动化声学分析。
J Acoust Soc Am. 2010 Apr;127(4):2563-77. doi: 10.1121/1.3327460.
10
Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy.基于互信息的特征选择:最大依赖、最大相关和最小冗余准则。
IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1226-38. doi: 10.1109/TPAMI.2005.159.