• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

多模态融合驱动的英语口语机器人。

Multimodal fusion-powered English speaking robot.

作者信息

Pan Ruiying

机构信息

The College of Henan Procuratorial Profession, Zhengzhou, China.

出版信息

Front Neurorobot. 2024 Nov 15;18:1478181. doi: 10.3389/fnbot.2024.1478181. eCollection 2024.

DOI:10.3389/fnbot.2024.1478181
PMID:39618808
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11604748/
Abstract

INTRODUCTION

Speech recognition and multimodal learning are two critical areas in machine learning. Current multimodal speech recognition systems often encounter challenges such as high computational demands and model complexity.

METHODS

To overcome these issues, we propose a novel framework-EnglishAL-Net, a Multimodal Fusion-powered English Speaking Robot. This framework leverages the ALBEF model, optimizing it for real-time speech and multimodal interaction, and incorporates a newly designed text and image editor to fuse visual and textual information. The robot processes dynamic spoken input through the integration of Neural Machine Translation (NMT), enhancing its ability to understand and respond to spoken language.

RESULTS AND DISCUSSION

In the experimental section, we constructed a dataset containing various scenarios and oral instructions for testing. The results show that compared to traditional unimodal processing methods, our model significantly improves both language understanding accuracy and response time. This research not only enhances the performance of multimodal interaction in robots but also opens up new possibilities for applications of robotic technology in education, rescue, customer service, and other fields, holding significant theoretical and practical value.

摘要

引言

语音识别和多模态学习是机器学习中的两个关键领域。当前的多模态语音识别系统经常面临诸如高计算需求和模型复杂性等挑战。

方法

为克服这些问题,我们提出了一种新颖的框架——EnglishAL-Net,一种由多模态融合驱动的英语口语机器人。该框架利用ALBEF模型,针对实时语音和多模态交互对其进行优化,并结合了新设计的文本和图像编辑器以融合视觉和文本信息。该机器人通过整合神经机器翻译(NMT)来处理动态口语输入,增强其理解和回应口语的能力。

结果与讨论

在实验部分,我们构建了一个包含各种场景和口头指令的数据集用于测试。结果表明,与传统的单模态处理方法相比,我们的模型显著提高了语言理解准确率和响应时间。本研究不仅提升了机器人多模态交互的性能,还为机器人技术在教育、救援、客户服务等领域的应用开辟了新的可能性,具有重要的理论和实践价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/df2eda2a92ce/fnbot-18-1478181-g0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/a71eed8e97c8/fnbot-18-1478181-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/f1bed58bd9f3/fnbot-18-1478181-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/cbe62dff70f7/fnbot-18-1478181-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/e1a8929bd76f/fnbot-18-1478181-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/5a19a4f4cba7/fnbot-18-1478181-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/2976c2e0f095/fnbot-18-1478181-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/0c14bd9a5ee4/fnbot-18-1478181-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/637fb1abf96f/fnbot-18-1478181-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/66c0cf8e0319/fnbot-18-1478181-g0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/df2eda2a92ce/fnbot-18-1478181-g0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/a71eed8e97c8/fnbot-18-1478181-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/f1bed58bd9f3/fnbot-18-1478181-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/cbe62dff70f7/fnbot-18-1478181-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/e1a8929bd76f/fnbot-18-1478181-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/5a19a4f4cba7/fnbot-18-1478181-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/2976c2e0f095/fnbot-18-1478181-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/0c14bd9a5ee4/fnbot-18-1478181-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/637fb1abf96f/fnbot-18-1478181-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/66c0cf8e0319/fnbot-18-1478181-g0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e028/11604748/df2eda2a92ce/fnbot-18-1478181-g0010.jpg

相似文献

1
Multimodal fusion-powered English speaking robot.多模态融合驱动的英语口语机器人。
Front Neurorobot. 2024 Nov 15;18:1478181. doi: 10.3389/fnbot.2024.1478181. eCollection 2024.
2
A multimodal educational robots driven via dynamic attention.一种通过动态注意力驱动的多模态教育机器人。
Front Neurorobot. 2024 Oct 31;18:1453061. doi: 10.3389/fnbot.2024.1453061. eCollection 2024.
3
Multimodal robot-assisted English writing guidance and error correction with reinforcement learning.基于强化学习的多模态机器人辅助英语写作指导与纠错
Front Neurorobot. 2024 Nov 20;18:1483131. doi: 10.3389/fnbot.2024.1483131. eCollection 2024.
4
MusicARLtrans Net: a multimodal agent interactive music education system driven via reinforcement learning.MusicARLtrans Net:一种通过强化学习驱动的多模态智能体交互式音乐教育系统。
Front Neurorobot. 2024 Nov 21;18:1479694. doi: 10.3389/fnbot.2024.1479694. eCollection 2024.
5
Multimodal machine learning for language and speech markers identification in mental health.多模态机器学习在心理健康中的语言和语音标志物识别。
BMC Med Inform Decis Mak. 2024 Nov 22;24(1):354. doi: 10.1186/s12911-024-02772-0.
6
MMAgentRec, a personalized multi-modal recommendation agent with large language model.MMAgentRec,一个带有大语言模型的个性化多模态推荐代理。
Sci Rep. 2025 Apr 8;15(1):12062. doi: 10.1038/s41598-025-96458-w.
7
A multimodal human-robot sign language interaction framework applied in social robots.一种应用于社交机器人的多模态人机手语交互框架。
Front Neurosci. 2023 Apr 11;17:1168888. doi: 10.3389/fnins.2023.1168888. eCollection 2023.
8
Remote sensing traffic scene retrieval based on learning control algorithm for robot multimodal sensing information fusion and human-machine interaction and collaboration.基于学习控制算法的机器人多模态传感信息融合与人机交互协作的遥感交通场景检索
Front Neurorobot. 2023 Oct 11;17:1267231. doi: 10.3389/fnbot.2023.1267231. eCollection 2023.
9
Multi-dimensional fusion: transformer and GANs-based multimodal audiovisual perception robot for musical performance art.多维融合:基于Transformer和生成对抗网络的用于音乐表演艺术的多模态视听感知机器人
Front Neurorobot. 2023 Sep 29;17:1281944. doi: 10.3389/fnbot.2023.1281944. eCollection 2023.
10
Multimodal learning-based speech enhancement and separation, recent innovations, new horizons, challenges and real-world applications.基于多模态学习的语音增强与分离、近期创新、新视野、挑战及实际应用。
Comput Biol Med. 2025 May;190:110082. doi: 10.1016/j.compbiomed.2025.110082. Epub 2025 Apr 1.

引用本文的文献

1
Interdisciplinary approaches to image processing for medical robotics.用于医疗机器人技术的图像处理跨学科方法。
Front Med (Lausanne). 2025 Jun 2;12:1564678. doi: 10.3389/fmed.2025.1564678. eCollection 2025.
2
Bridging language gaps: The role of NLP and speech recognition in oral english instruction.弥合语言差距:自然语言处理和语音识别在英语口语教学中的作用。
MethodsX. 2025 May 7;14:103359. doi: 10.1016/j.mex.2025.103359. eCollection 2025 Jun.
3
Cross-attention swin-transformer for detailed segmentation of ancient architectural color patterns.

本文引用的文献

1
Intuitive and versatile bionic legs: a perspective on volitional control.直观且多功能的仿生腿:关于自主控制的见解
Front Neurorobot. 2024 Jun 20;18:1410760. doi: 10.3389/fnbot.2024.1410760. eCollection 2024.
2
Multi-granularity contrastive learning model for next POI recommendation.用于下一个兴趣点推荐的多粒度对比学习模型。
Front Neurorobot. 2024 Jun 14;18:1428785. doi: 10.3389/fnbot.2024.1428785. eCollection 2024.
3
Understanding older people's voice interactions with smart voice assistants: a new modified rule-based natural language processing model with human input.
用于古建筑色彩图案精细分割的交叉注意力窗口变压器
Front Neurorobot. 2024 Dec 13;18:1513488. doi: 10.3389/fnbot.2024.1513488. eCollection 2024.
理解老年人与智能语音助手的语音交互:一种融入人工输入的新型改进型基于规则的自然语言处理模型。
Front Digit Health. 2024 May 14;6:1329910. doi: 10.3389/fdgth.2024.1329910. eCollection 2024.
4
Multimodal Deep Reinforcement Learning with Auxiliary Task for Obstacle Avoidance of Indoor Mobile Robot.多模态深度强化学习与辅助任务在室内移动机器人避障中的应用。
Sensors (Basel). 2021 Feb 15;21(4):1363. doi: 10.3390/s21041363.
5
A Multimodal Emotional Human-Robot Interaction Architecture for Social Robots Engaged in Bidirectional Communication.一种用于从事双向通信的社交机器人的多模态情感人机交互架构。
IEEE Trans Cybern. 2021 Dec;51(12):5954-5968. doi: 10.1109/TCYB.2020.2974688. Epub 2021 Dec 22.
6
DEEP MULTIMODAL LEARNING FOR EMOTION RECOGNITION IN SPOKEN LANGUAGE.用于口语情感识别的深度多模态学习
Proc IEEE Int Conf Acoust Speech Signal Process. 2018 Apr;2018:5079-5083. doi: 10.1109/ICASSP.2018.8462440. Epub 2018 Sep 13.
7
Mental status assessment of disaster relief personnel by vocal affect display based on voice emotion recognition.基于语音情感识别的救灾人员语音情感表露心理状态评估
Disaster Mil Med. 2017 Apr 8;3:4. doi: 10.1186/s40696-017-0032-0. eCollection 2017.