Tang Mingjie, Yu Yongyang, Mahmood Ahmed R, Malluhi Qutaibah M, Ouzzani Mourad, Aref Walid G
Chinese Academy of Science, Beijing, China.
Facebook, Menlo Park, CA, United States.
Front Big Data. 2020 Oct 16;3:30. doi: 10.3389/fdata.2020.00030. eCollection 2020.
Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew that commonly happens in practice, and minimizes communication costs accordingly. We propose a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler utilizes new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. Our prototype system is termed LocationSpark. The experimental study is based on real datasets and demonstrates that LocationSpark can enhance distributed spatial query processing by up to an order of magnitude over existing in-memory and distributed spatial systems.
由于空间数据应用的广泛存在以及这些应用生成和处理的大量空间数据,对可扩展的空间查询处理存在迫切需求。在本文中,我们提出了在内存和分布式环境中进行空间查询处理和优化的新技术,以解决可扩展性问题。更具体地说,我们引入了处理实际中常见的查询倾斜的新技术,并相应地最小化通信成本。我们提出了一种分布式查询调度器,它使用一种新的成本模型来最小化空间查询处理的成本。该调度器生成能最小化查询倾斜影响的查询执行计划。查询调度器利用基于位图过滤器的新空间索引技术将查询转发到适当的本地节点。每个本地计算节点负责根据该节点中的索引和空间查询的性质来优化和选择其最佳的本地查询执行计划。所有提出的空间查询处理和优化技术都在Spark(一个基于分布式内存的计算系统)中进行了原型实现。我们的原型系统称为LocationSpark。实验研究基于真实数据集,结果表明LocationSpark在分布式空间查询处理方面比现有的内存和分布式空间系统能提高多达一个数量级。