Fathy Ghada M, Hassan Hanan A, Sheta Walaa, Omara Fatma A, Nabil Emad
Informatics Research Institute, City for Scientific Research and Technological Applications, SRTA-City, Alexandria, Egypt.
Department of Computer Science, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt.
PeerJ Comput Sci. 2021 May 12;7:e529. doi: 10.7717/peerj-cs.529. eCollection 2021.
Occlusion awareness is one of the most challenging problems in several fields such as multimedia, remote sensing, computer vision, and computer graphics. Realistic interaction applications are suffering from dealing with occlusion and collision problems in a dynamic environment. Creating dense 3D reconstruction methods is the best solution to solve this issue. However, these methods have poor performance in practical applications due to the absence of accurate depth, camera pose, and object motion.This paper proposes a new framework that builds a full 3D model reconstruction that overcomes the occlusion problem in a complex dynamic scene without using sensors' data. Popular devices such as a monocular camera are used to generate a suitable model for video streaming applications. The main objective is to create a smooth and accurate 3D point-cloud for a dynamic environment using cumulative information of a sequence of RGB video frames. The framework is composed of two main phases. The first uses an unsupervised learning technique to predict scene depth, camera pose, and objects' motion from RGB monocular videos. The second generates a frame-wise point cloud fusion to reconstruct a 3D model based on a video frame sequence. Several evaluation metrics are measured: Localization error, RMSE, and fitness between ground truth (KITTI's sparse LiDAR points) and predicted point-cloud. Moreover, we compared the framework with different widely used state-of-the-art evaluation methods such as MRE and Chamfer Distance. Experimental results showed that the proposed framework surpassed the other methods and proved to be a powerful candidate in 3D model reconstruction.
遮挡感知是多媒体、遥感、计算机视觉和计算机图形学等多个领域中最具挑战性的问题之一。现实交互应用在动态环境中处理遮挡和碰撞问题时面临困境。创建密集三维重建方法是解决此问题的最佳方案。然而,由于缺乏精确的深度、相机姿态和物体运动信息,这些方法在实际应用中性能不佳。本文提出了一种新框架,该框架可构建完整的三维模型重建,无需使用传感器数据即可克服复杂动态场景中的遮挡问题。使用单目相机等常见设备为视频流应用生成合适的模型。主要目标是利用RGB视频帧序列的累积信息为动态环境创建平滑且精确的三维点云。该框架由两个主要阶段组成。第一阶段使用无监督学习技术从RGB单目视频中预测场景深度、相机姿态和物体运动。第二阶段基于视频帧序列生成逐帧点云融合以重建三维模型。测量了多个评估指标:定位误差、均方根误差以及与地面真值(KITTI的稀疏激光雷达点)和预测点云之间的拟合度。此外,我们将该框架与不同的广泛使用的先进评估方法(如平均相对误差和倒角距离)进行了比较。实验结果表明,所提出的框架优于其他方法,在三维模型重建方面被证明是一个有力的候选方案。