Department of Computer Science and Engineering, Tatung University, Taipei City 104, Taiwan.
Department of Information Management, National Central University, Taoyuan City 320, Taiwan.
Sensors (Basel). 2023 Jul 8;23(14):6250. doi: 10.3390/s23146250.
In recent years, many things have been held via video conferences due to the impact of the COVID-19 epidemic around the world. A webcam will be used in conjunction with a computer and the Internet. However, the network camera cannot automatically turn and cannot lock the screen to the speaker. Therefore, this study uses the objection detector YOLO to capture the upper body of all people on the screen and judge whether each person opens or closes their mouth. At the same time, the Time Difference of Arrival (TDOA) is used to detect the angle of the sound source. Finally, the person's position obtained by YOLO is reversed to the person's position in the spatial coordinates through the distance between the person and the camera. Then, the spatial coordinates are used to calculate the angle between the person and the camera through inverse trigonometric functions. Finally, the angle obtained by the camera, and the angle of the sound source obtained by the microphone array, are matched for positioning. The experimental results show that the recall rate of positioning through YOLOX-Tiny reached 85.2%, and the recall rate of TDOA alone reached 88%. Integrating YOLOX-Tiny and TDOA for positioning, the recall rate reached 86.7%, the precision rate reached 100%, and the accuracy reached 94.5%. Therefore, the method proposed in this study can locate the speaker, and it has a better effect than using only one source.
近年来,由于全球 COVID-19 疫情的影响,许多事情都通过视频会议进行。网络摄像头将与计算机和互联网一起使用。然而,网络摄像头无法自动旋转,也无法将屏幕锁定到扬声器。因此,本研究使用目标检测器 YOLO 捕捉屏幕上所有人的上半身,并判断每个人是否张开或闭合嘴巴。同时,使用到达时间差(TDOA)检测声源的角度。最后,通过 YOLO 获得的人的位置通过人与摄像头之间的距离被反转到空间坐标中的人的位置。然后,通过反三角函数计算人与摄像头之间的空间坐标的角度。最后,将摄像头获得的角度与麦克风阵列获得的声源角度进行匹配以进行定位。实验结果表明,通过 YOLOX-Tiny 进行定位的召回率达到 85.2%,而单独使用 TDOA 的召回率达到 88%。将 YOLOX-Tiny 和 TDOA 集成进行定位,召回率达到 86.7%,准确率达到 100%,准确率达到 94.5%。因此,本研究提出的方法可以定位说话人,并且比仅使用一个声源的效果更好。