School of Automation, Jiangsu University of Science and Technology, No. 666 Changhui Road, Zhenjiang 212100, China.
Systems Science Laboratory, Jiangsu University of Science and Technology, No. 666 Changhui Road, Zhenjiang 212100, China.
Sensors (Basel). 2023 Mar 22;23(6):3340. doi: 10.3390/s23063340.
In the field of vision-based robot grasping, effectively leveraging RGB and depth information to accurately determine the position and pose of a target is a critical issue. To address this challenge, we proposed a tri-stream cross-modal fusion architecture for 2-DoF visual grasp detection. This architecture facilitates the interaction of RGB and depth bilateral information and was designed to efficiently aggregate multiscale information. Our novel modal interaction module (MIM) with a spatial-wise cross-attention algorithm adaptively captures cross-modal feature information. Meanwhile, the channel interaction modules (CIM) further enhance the aggregation of different modal streams. In addition, we efficiently aggregated global multiscale information through a hierarchical structure with skipping connections. To evaluate the performance of our proposed method, we conducted validation experiments on standard public datasets and real robot grasping experiments. We achieved image-wise detection accuracy of 99.4% and 96.7% on Cornell and Jacquard datasets, respectively. The object-wise detection accuracy reached 97.8% and 94.6% on the same datasets. Furthermore, physical experiments using the 6-DoF Elite robot demonstrated a success rate of 94.5%. These experiments highlight the superior accuracy of our proposed method.
在基于视觉的机器人抓取领域,有效地利用 RGB 和深度信息来准确确定目标的位置和姿态是一个关键问题。为了解决这个挑战,我们提出了一种用于 2-DoF 视觉抓取检测的三流交叉模态融合架构。该架构促进了 RGB 和深度双边信息的交互,并设计用于高效聚合多尺度信息。我们新颖的模态交互模块(MIM)具有空间-wise 交叉注意力算法,自适应地捕获跨模态特征信息。同时,通道交互模块(CIM)进一步增强了不同模态流的聚合。此外,我们通过具有跳过连接的分层结构高效地聚合了全局多尺度信息。为了评估我们提出的方法的性能,我们在标准公共数据集和真实机器人抓取实验上进行了验证实验。我们在 Cornell 和 Jacquard 数据集上分别实现了图像级别的检测准确率为 99.4%和 96.7%,在相同数据集上的目标级别的检测准确率达到了 97.8%和 94.6%。此外,使用 6-DoF Elite 机器人进行的物理实验表明成功率为 94.5%。这些实验突出了我们提出的方法的优越准确性。