Jianliang An
College of Education, Hebei Normal University, Hebei, China.
Industrial Fundamentals Teaching Department, Hebei Vocational University of Industry and Technology, Shijiazhuang, Hebei, China.
Front Neurorobot. 2024 Oct 31;18:1453061. doi: 10.3389/fnbot.2024.1453061. eCollection 2024.
With the development of artificial intelligence and robotics technology, the application of educational robots in teaching is becoming increasingly popular. However, effectively evaluating and optimizing multimodal educational robots remains a challenge.
This study introduces Res-ALBEF, a multimodal educational robot framework driven by dynamic attention. Res-ALBEF enhances the ALBEF (Align Before Fuse) method by incorporating residual connections to align visual and textual data more effectively before fusion. In addition, the model integrates a VGG19-based convolutional network for image feature extraction and utilizes a dynamic attention mechanism to dynamically focus on relevant parts of multimodal inputs. Our model was trained using a diverse dataset consisting of 50,000 multimodal educational instances, covering a variety of subjects and instructional content.
The evaluation on an independent validation set of 10,000 samples demonstrated significant performance improvements: the model achieved an overall accuracy of 97.38% in educational content recognition. These results highlight the model's ability to improve alignment and fusion of multimodal information, making it a robust solution for multimodal educational robots.
随着人工智能和机器人技术的发展,教育机器人在教学中的应用越来越广泛。然而,有效评估和优化多模态教育机器人仍然是一个挑战。
本研究介绍了Res-ALBEF,一种由动态注意力驱动的多模态教育机器人框架。Res-ALBEF通过合并残差连接来增强ALBEF(融合前对齐)方法,以便在融合之前更有效地对齐视觉和文本数据。此外,该模型集成了一个基于VGG19的卷积网络用于图像特征提取,并利用动态注意力机制动态关注多模态输入的相关部分。我们的模型使用了一个由50,000个多模态教育实例组成的多样化数据集进行训练,涵盖了各种学科和教学内容。
对一个由10,000个样本组成的独立验证集的评估表明性能有显著提升:该模型在教育内容识别方面的总体准确率达到了97.38%。这些结果突出了该模型在改善多模态信息对齐和融合方面的能力,使其成为多模态教育机器人的一个强大解决方案。