Pande Akshara, Mishra Deepti
Educational Technology Laboratory, Intelligent System and Analytics Group, Department of Computer Science (IDI), Norwegian University of Science and Technology, 2815 Gjøvik, Norway.
Biomimetics (Basel). 2024 Jun 27;9(7):391. doi: 10.3390/biomimetics9070391.
Speech comprehension can be challenging due to multiple factors, causing inconvenience for both the speaker and the listener. In such situations, using a humanoid robot, Pepper, can be beneficial as it can display the corresponding text on its screen. However, prior to that, it is essential to carefully assess the accuracy of the audio recordings captured by Pepper. Therefore, in this study, an experiment is conducted with eight participants with the primary objective of examining Pepper's speech recognition system with the help of audio features such as Mel-Frequency Cepstral Coefficients, spectral centroid, spectral flatness, the Zero-Crossing Rate, pitch, and energy. Furthermore, the K-means algorithm was employed to create clusters based on these features with the aim of selecting the most suitable cluster with the help of the speech-to-text conversion tool Whisper. The selection of the best cluster is accomplished by finding the maximum accuracy data points lying in a cluster. A criterion of discarding data points with values of WER above 0.3 is imposed to achieve this. The findings of this study suggest that a distance of up to one meter from the humanoid robot Pepper is suitable for capturing the best speech recordings. In contrast, age and gender do not influence the accuracy of recorded speech. The proposed system will provide a significant strength in settings where subtitles are required to improve the comprehension of spoken statements.
由于多种因素,言语理解可能具有挑战性,这给说话者和听众都带来了不便。在这种情况下,使用人形机器人Pepper可能会有所帮助,因为它可以在其屏幕上显示相应的文本。然而,在此之前,仔细评估Pepper捕获的音频记录的准确性至关重要。因此,在本研究中,对八名参与者进行了一项实验,其主要目的是借助诸如梅尔频率倒谱系数、谱质心、谱平坦度、过零率、音高和能量等音频特征来检验Pepper的语音识别系统。此外,采用K均值算法基于这些特征创建聚类,目的是借助语音转文本转换工具Whisper选择最合适的聚类。通过找到聚类中准确率最高的数据点来完成最佳聚类的选择。为此,施加了一个丢弃字错误率高于0.3的数据点的标准。本研究的结果表明,与人形机器人Pepper保持最多一米的距离适合捕获最佳语音记录。相比之下,年龄和性别不会影响录制语音的准确性。所提出的系统将在需要字幕以提高对口头陈述理解的环境中提供显著优势。