Kim Dongyi, Cho Hyeon, Shin Hochul, Lim Soo-Chul, Hwang Wonjun
Department of Software and Computer Engineering, Ajou University, 206 Worldcup-ro, Yeongtong-gu, Suwon 16499, Korea.
Department of Mechanical, Robotics and Energy Engineering, Dongguk University, 30, Pildong-ro 1gil, Jung-gu, Seoul 04620, Korea.
Sensors (Basel). 2019 Aug 17;19(16):3579. doi: 10.3390/s19163579.
Interaction forces are traditionally predicted by a contact type haptic sensor. In this paper, we propose a novel and practical method for inferring the interaction forces between two objects based only on video data-one of the non-contact type camera sensors-without the use of common haptic sensors. In detail, we could predict the interaction force by observing the texture changes of the target object by an external force. For this purpose, our hypothesis is that a three-dimensional (3D) convolutional neural network (CNN) can be made to predict the physical interaction forces from video images. In this paper, we proposed a bottleneck-based 3D depthwise separable CNN architecture where the video is disentangled into spatial and temporal information. By applying the basic depthwise convolution concept to each video frame, spatial information can be efficiently learned; for temporal information, the 3D pointwise convolution can be used to learn the linear combination among sequential frames. To validate and train the proposed model, we collected large quantities of datasets, which are video clips of the physical interactions between two objects under different conditions (illumination and angle variations) and the corresponding interaction forces measured by the haptic sensor (as the ground truth). Our experimental results confirmed our hypothesis; when compared with previous models, the proposed model was more accurate and efficient, and although its model size was 10 times smaller, the 3D convolutional neural network architecture exhibited better accuracy. The experiments demonstrate that the proposed model remains robust under different conditions and can successfully estimate the interaction force between objects.
传统上,相互作用力是由接触式触觉传感器预测的。在本文中,我们提出了一种新颖且实用的方法,仅基于视频数据(非接触式相机传感器之一)来推断两个物体之间的相互作用力,而无需使用普通的触觉传感器。详细地说,我们可以通过观察目标物体在外力作用下的纹理变化来预测相互作用力。为此,我们的假设是,可以使三维(3D)卷积神经网络(CNN)从视频图像中预测物理相互作用力。在本文中,我们提出了一种基于瓶颈的3D深度可分离CNN架构,其中视频被分解为空间和时间信息。通过将基本的深度卷积概念应用于每个视频帧,可以有效地学习空间信息;对于时间信息,可以使用3D逐点卷积来学习连续帧之间的线性组合。为了验证和训练所提出的模型,我们收集了大量数据集,这些数据集是两个物体在不同条件(光照和角度变化)下物理相互作用的视频片段以及触觉传感器测量的相应相互作用力(作为地面真值)。我们的实验结果证实了我们的假设;与先前的模型相比,所提出的模型更准确、高效,并且尽管其模型大小小10倍,但3D卷积神经网络架构表现出更好的准确性。实验表明,所提出的模型在不同条件下保持稳健,并且可以成功估计物体之间的相互作用力。