Kothinti Sandeep Reddy, Elhilali Mounya
Department of Electrical and Computer Engineering, Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD, United States.
Front Psychol. 2023 Nov 30;14:1276237. doi: 10.3389/fpsyg.2023.1276237. eCollection 2023.
Auditory salience is a fundamental property of a sound that allows it to grab a listener's attention regardless of their attentional state or behavioral goals. While previous research has shed light on acoustic factors influencing auditory salience, the semantic dimensions of this phenomenon have remained relatively unexplored owing both to the complexity of measuring salience in audition as well as limited focus on complex natural scenes. In this study, we examine the relationship between acoustic, contextual, and semantic attributes and their impact on the auditory salience of natural audio scenes using a dichotic listening paradigm. The experiments present acoustic scenes in forward and backward directions; the latter allows to diminish semantic effects, providing a counterpoint to the effects observed in forward scenes. The behavioral data collected from a crowd-sourced platform reveal a striking convergence in temporal salience maps for certain sound events, while marked disparities emerge in others. Our main hypothesis posits that differences in the perceptual salience of events are predominantly driven by semantic and contextual cues, particularly evident in those cases displaying substantial disparities between forward and backward presentations. Conversely, events exhibiting a high degree of alignment can largely be attributed to low-level acoustic attributes. To evaluate this hypothesis, we employ analytical techniques that combine rich low-level mappings from acoustic profiles with high-level embeddings extracted from a deep neural network. This integrated approach captures both acoustic and semantic attributes of acoustic scenes along with their temporal trajectories. The results demonstrate that perceptual salience is a careful interplay between low-level and high-level attributes that shapes which moments stand out in a natural soundscape. Furthermore, our findings underscore the important role of longer-term context as a critical component of auditory salience, enabling us to discern and adapt to temporal regularities within an acoustic scene. The experimental and model-based validation of semantic factors of salience paves the way for a complete understanding of auditory salience. Ultimately, the empirical and computational analyses have implications for developing large-scale models for auditory salience and audio analytics.
听觉显著性是声音的一种基本属性,它能让声音吸引听众的注意力,而不论其注意力状态或行为目标如何。虽然先前的研究已经揭示了影响听觉显著性的声学因素,但由于测量听觉显著性的复杂性以及对复杂自然场景的关注有限,这一现象的语义维度仍相对未被探索。在本研究中,我们使用双耳分听范式来考察声学、语境和语义属性之间的关系,以及它们对自然音频场景听觉显著性的影响。实验以正向和反向呈现声学场景;后者能够减少语义效应,与正向场景中观察到的效应形成对比。从众包平台收集的行为数据显示,某些声音事件的时间显著性图谱存在显著的一致性,而在其他事件中则出现了明显的差异。我们的主要假设认为,事件感知显著性的差异主要由语义和语境线索驱动,在正向和反向呈现之间存在显著差异的那些情况中尤为明显。相反,表现出高度一致性的事件在很大程度上可归因于低层次的声学属性。为了评估这一假设,我们采用了分析技术,将来自声学特征的丰富低层次映射与从深度神经网络中提取的高层次嵌入相结合。这种综合方法捕捉了声学场景的声学和语义属性及其时间轨迹。结果表明,感知显著性是低层次和高层次属性之间的精细相互作用,它塑造了自然音景中哪些时刻最为突出。此外,我们的研究结果强调了长期语境作为听觉显著性关键组成部分的重要作用,使我们能够辨别并适应声学场景中的时间规律。对显著性语义因素的实验和基于模型的验证为全面理解听觉显著性铺平了道路。最终,实证和计算分析对开发听觉显著性和音频分析的大规模模型具有启示意义。