Pan Ruiying
The College of Henan Procuratorial Profession, Zhengzhou, China.
Front Neurorobot. 2024 Nov 15;18:1478181. doi: 10.3389/fnbot.2024.1478181. eCollection 2024.
Speech recognition and multimodal learning are two critical areas in machine learning. Current multimodal speech recognition systems often encounter challenges such as high computational demands and model complexity.
To overcome these issues, we propose a novel framework-EnglishAL-Net, a Multimodal Fusion-powered English Speaking Robot. This framework leverages the ALBEF model, optimizing it for real-time speech and multimodal interaction, and incorporates a newly designed text and image editor to fuse visual and textual information. The robot processes dynamic spoken input through the integration of Neural Machine Translation (NMT), enhancing its ability to understand and respond to spoken language.
In the experimental section, we constructed a dataset containing various scenarios and oral instructions for testing. The results show that compared to traditional unimodal processing methods, our model significantly improves both language understanding accuracy and response time. This research not only enhances the performance of multimodal interaction in robots but also opens up new possibilities for applications of robotic technology in education, rescue, customer service, and other fields, holding significant theoretical and practical value.
语音识别和多模态学习是机器学习中的两个关键领域。当前的多模态语音识别系统经常面临诸如高计算需求和模型复杂性等挑战。
为克服这些问题,我们提出了一种新颖的框架——EnglishAL-Net,一种由多模态融合驱动的英语口语机器人。该框架利用ALBEF模型,针对实时语音和多模态交互对其进行优化,并结合了新设计的文本和图像编辑器以融合视觉和文本信息。该机器人通过整合神经机器翻译(NMT)来处理动态口语输入,增强其理解和回应口语的能力。
在实验部分,我们构建了一个包含各种场景和口头指令的数据集用于测试。结果表明,与传统的单模态处理方法相比,我们的模型显著提高了语言理解准确率和响应时间。本研究不仅提升了机器人多模态交互的性能,还为机器人技术在教育、救援、客户服务等领域的应用开辟了新的可能性,具有重要的理论和实践价值。