Department of Health Sciences & Technology, Gachon Advanced Institute for Health Sciences & Technology (GAIHST), Gachon University, Incheon 21936, Korea.
School of Information, University of California, 102 South Hall 4600, Berkeley, CA 94720, USA.
Sensors (Basel). 2022 Jun 20;22(12):4650. doi: 10.3390/s22124650.
Voice-activated artificial intelligence (AI) technology has advanced rapidly and is being adopted in various devices such as smart speakers and display products, which enable users to multitask without touching the devices. However, most devices equipped with cameras and displays lack mobility; therefore, users cannot avoid touching them for face-to-face interactions, which contradicts the voice-activated AI philosophy. In this paper, we propose a deep neural network-based real-time sound source localization (SSL) model for low-power internet of things (IoT) devices based on microphone arrays and present a prototype implemented on actual IoT devices. The proposed SSL model delivers multi-channel acoustic data to parallel convolutional neural network layers in the form of multiple streams to capture the unique delay patterns for the low-, mid-, and high-frequency ranges, and estimates the fine and coarse location of voices. The model adapted in this study achieved an accuracy of 91.41% on fine location estimation and a direction of arrival error of 7.43° on noisy data. It achieved a processing time of 7.811 ms per 40 ms samples on the Raspberry Pi 4B. The proposed model can be applied to a camera-based humanoid robot that mimics the manner in which humans react to trigger voices in crowded environments.
语音激活人工智能(AI)技术发展迅速,正在各种设备中得到应用,如智能音箱和显示产品,使用户能够在不触摸设备的情况下同时处理多项任务。然而,大多数配备摄像头和显示器的设备缺乏移动性;因此,用户无法避免进行面对面的互动,这与语音激活 AI 理念背道而驰。在本文中,我们提出了一种基于深度神经网络的实时声源定位(SSL)模型,用于基于麦克风阵列的低功耗物联网(IoT)设备,并展示了在实际 IoT 设备上实现的原型。所提出的 SSL 模型将多通道声数据以多个流的形式发送到并行卷积神经网络层,以捕获低频、中频和高频范围的独特延迟模式,并估计声音的精细和粗略位置。本研究中采用的模型在精细位置估计方面达到了 91.41%的准确率,在嘈杂数据方面的到达方向误差为 7.43°。它在 Raspberry Pi 4B 上实现了每 40ms 样本处理时间为 7.811ms 的处理速度。所提出的模型可应用于基于摄像头的人形机器人,模仿人类在嘈杂环境中对触发声音的反应方式。