Liu Zhipeng, Hua Weihua, Liu Xiuguo, Liang Dong, Zhao Yabo, Shi Manxing
School of Geography and Information Engineering, China University of Geosciences, Wuhan 430074, China.
Sensors (Basel). 2021 Dec 5;21(23):8132. doi: 10.3390/s21238132.
Geospatial three-dimensional (3D) raster data have been widely used for simple representations and analysis, such as geological models, spatio-temporal satellite data, hyperspectral images, and climate data. With the increasing requirements of resolution and accuracy, the amount of geospatial 3D raster data has grown exponentially. In recent years, the processing of large raster data using Hadoop has gained popularity. However, data uploaded to Hadoop are randomly distributed onto datanodes without consideration of the spatial characteristics. As a result, the direct processing of geospatial 3D raster data produces a massive network data exchange among the datanodes and degrades the performance of the cluster. To address this problem, we propose an efficient group-based replica placement policy for large-scale geospatial 3D raster data, aiming to optimize the locations of the replicas in the cluster to reduce the network overhead. An overlapped group scheme was designed for three replicas of each file. The data in each group were placed in the same datanode, and different colocation patterns for three replicas were implemented to further reduce the communication between groups. The experimental results show that our approach significantly reduces the network overhead during data acquisition for 3D raster data in the Hadoop cluster, and maintains the Hadoop replica placement requirements.
地理空间三维(3D)栅格数据已被广泛用于简单表示和分析,如地质模型、时空卫星数据、高光谱图像和气候数据。随着分辨率和精度要求的不断提高,地理空间3D栅格数据量呈指数级增长。近年来,使用Hadoop处理大型栅格数据变得越来越流行。然而,上传到Hadoop的数据是随机分布在数据节点上的,而没有考虑空间特征。因此,直接处理地理空间3D栅格数据会在数据节点之间产生大量网络数据交换,并降低集群性能。为了解决这个问题,我们针对大规模地理空间3D栅格数据提出了一种高效的基于组的副本放置策略,旨在优化集群中副本的位置以减少网络开销。为每个文件的三个副本设计了一种重叠组方案。每个组中的数据放置在同一个数据节点中,并实现了三个副本的不同共置模式以进一步减少组间通信。实验结果表明,我们的方法显著降低了Hadoop集群中3D栅格数据数据采集期间的网络开销,并满足了Hadoop副本放置要求。