Li Yifan, Jiang Peiyang, Chai Chengpeng, Zhang Xuyang, Liu Chengguo
Faculty of Engineering, University of Bristol, Bristol, BS8 1QU, UK.
School of Engineering, Westlake University, Hangzhou, 310030, China.
Sci Rep. 2025 Aug 24;15(1):31141. doi: 10.1038/s41598-025-17015-z.
Humans tackle unknown tasks by integrating information from multiple sensory modalities. Existing robotic frameworks struggle to achieve effective multimodal manipulation, especially when sufficient training data is lacking. This study introduces "Panda Act", a novel robotic manipulation mechanism that leverages large language models (LLMs) and multimodal zero-shot models. The manipulation strategies are generated by LLMs as Python code, which dynamically orchestrates a suite of zero-shot visual and auditory models to fulfil task requirements. This enables robots to execute multimodal manipulations without requiring additional training. Extensive experiments in both simulated and real-world environments demonstrate that this approach excels in task comprehension, zero-shot execution, and adaptability, opening new avenues for enhancing robot adaptability in uncertain environments.
人类通过整合来自多种感官模态的信息来处理未知任务。现有的机器人框架难以实现有效的多模态操作,尤其是在缺乏足够训练数据的情况下。本研究引入了“熊猫行动”,这是一种利用大语言模型(LLMs)和多模态零样本模型的新型机器人操作机制。操作策略由大语言模型生成为Python代码,该代码动态协调一组零样本视觉和听觉模型以满足任务要求。这使得机器人无需额外训练就能执行多模态操作。在模拟和现实世界环境中的大量实验表明,这种方法在任务理解、零样本执行和适应性方面表现出色,为增强机器人在不确定环境中的适应性开辟了新途径。