O'Donovan Rebecca, Sezgin Emre, Bambach Sven, Butter Eric, Lin Simon
The Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH, United States.
Department of Psychology, Nationwide Children's Hospital, Columbus, OH, United States.
JMIR Form Res. 2020 Jun 16;4(6):e18279. doi: 10.2196/18279.
Qualitative self- or parent-reports used in assessing children's behavioral disorders are often inconvenient to collect and can be misleading due to missing information, rater biases, and limited validity. A data-driven approach to quantify behavioral disorders could alleviate these concerns. This study proposes a machine learning approach to identify screams in voice recordings that avoids the need to gather large amounts of clinical data for model training.
The goal of this study is to evaluate if a machine learning model trained only on publicly available audio data sets could be used to detect screaming sounds in audio streams captured in an at-home setting.
Two sets of audio samples were prepared to evaluate the model: a subset of the publicly available AudioSet data set and a set of audio data extracted from the TV show Supernanny, which was chosen for its similarity to clinical data. Scream events were manually annotated for the Supernanny data, and existing annotations were refined for the AudioSet data. Audio feature extraction was performed with a convolutional neural network pretrained on AudioSet. A gradient-boosted tree model was trained and cross-validated for scream classification on the AudioSet data and then validated independently on the Supernanny audio.
On the held-out AudioSet clips, the model achieved a receiver operating characteristic (ROC)-area under the curve (AUC) of 0.86. The same model applied to three full episodes of Supernanny audio achieved an ROC-AUC of 0.95 and an average precision (positive predictive value) of 42% despite screams only making up 1.3% (n=92/7166 seconds) of the total run time.
These results suggest that a scream-detection model trained with publicly available data could be valuable for monitoring clinical recordings and identifying tantrums as opposed to depending on collecting costly privacy-protected clinical data for model training.
用于评估儿童行为障碍的定性自我报告或家长报告往往难以收集,并且由于信息缺失、评分者偏差和有效性有限,可能会产生误导。一种数据驱动的方法来量化行为障碍可以缓解这些问题。本研究提出了一种机器学习方法来识别语音记录中的尖叫声,该方法无需收集大量临床数据进行模型训练。
本研究的目的是评估仅在公开可用音频数据集上训练的机器学习模型是否可用于检测在家中捕获的音频流中的尖叫声。
准备了两组音频样本以评估该模型:公开可用的AudioSet数据集的一个子集,以及从电视节目《超级保姆》中提取的一组音频数据,选择该节目是因为其与临床数据相似。对《超级保姆》数据中的尖叫事件进行了人工标注,并对AudioSet数据的现有标注进行了完善。使用在AudioSet上预训练的卷积神经网络进行音频特征提取。训练了一个梯度提升树模型,并对AudioSet数据上的尖叫分类进行交叉验证,然后在《超级保姆》音频上独立验证。
在留出的AudioSet片段上,该模型的曲线下面积(AUC)达到了0.86。将相同模型应用于三集完整的《超级保姆》音频,尽管尖叫声仅占总运行时间的1.3%(n = 92 / 7166秒),但其ROC-AUC为0.95,平均精度(阳性预测值)为42%。
这些结果表明,使用公开可用数据训练的尖叫检测模型对于监测临床记录和识别发脾气可能很有价值,而不是依赖于收集昂贵的受隐私保护的临床数据进行模型训练。