PARSUC：一种基于并行子采样的遥感大数据聚类方法。

PARSUC: A Parallel Subsampling-Based Method for Clustering Remote Sensing Big Data.

作者信息

Xia Huiyu, Huang Wei, Li Ning, Zhou Jianzhong, Zhang Dongying

机构信息

Yangtze River Waterway Bureau, Nanjing 210011, China.

School of Hydropower and Information Engineering, Huazhong University of Science and Technology, Wuhan 430074, China.

出版信息

Sensors (Basel). 2019 Aug 5;19(15):3438. doi: 10.3390/s19153438.

DOI:10.3390/s19153438

PMID:31387335

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6696378/

Abstract

Remote sensing big data (RSBD) is generally characterized by huge volumes, diversity, and high dimensionality. Mining hidden information from RSBD for different applications imposes significant computational challenges. Clustering is an important data mining technique widely used in processing and analyzing remote sensing imagery. However, conventional clustering algorithms are designed for relatively small datasets. When applied to problems with RSBD, they are, in general, too slow or inefficient for practical use. In this paper, we proposed a parallel subsampling-based clustering (PARSUC) method for improving the performance of RSBD clustering in terms of both efficiency and accuracy. PARSUC leverages a novel subsampling-based data partitioning (SubDP) method to realize three-step parallel clustering, effectively solving the notable performance bottleneck of the existing parallel clustering algorithms; that is, they must cope with numerous repeated calculations to get a reasonable result. Furthermore, we propose a centroid filtering algorithm (CFA) to eliminate subsampling errors and to guarantee the accuracy of the clustering results. PARSUC was implemented on a Hadoop platform by using the MapReduce parallel model. Experiments conducted on massive remote sensing imageries with different sizes showed that PARSUC (1) provided much better accuracy than conventional remote sensing clustering algorithms in handling larger image data; (2) achieved notable scalability with increased computing nodes added; and (3) spent much less time than the existing parallel clustering algorithm in handling RSBD.

摘要

遥感大数据（RSBD）通常具有数据量巨大、种类多样和维度高的特点。从RSBD中挖掘隐藏信息以用于不同应用面临着重大的计算挑战。聚类是一种重要的数据挖掘技术，广泛应用于遥感影像的处理和分析。然而，传统的聚类算法是为相对较小的数据集设计的。当应用于RSBD问题时，它们通常速度太慢或效率太低，无法实际应用。在本文中，我们提出了一种基于并行子采样的聚类（PARSUC）方法，以在效率和准确性方面提高RSBD聚类的性能。PARSUC利用一种新颖的基于子采样的数据分区（SubDP）方法来实现三步并行聚类，有效解决了现有并行聚类算法显著的性能瓶颈，即它们必须处理大量重复计算才能得到合理的结果。此外，我们提出了一种质心滤波算法（CFA）来消除子采样误差并保证聚类结果的准确性。PARSUC通过使用MapReduce并行模型在Hadoop平台上实现。对不同大小的海量遥感影像进行的实验表明，PARSUC（1）在处理更大的图像数据时比传统的遥感聚类算法提供了更好的准确性；（2）随着添加的计算节点增加，实现了显著的可扩展性；（3）在处理RSBD时比现有的并行聚类算法花费的时间少得多。