Aji Ablimit, Wang Fusheng, Saltz Joel H
Department of Mathematics & Computer Science, Emory University.
Department of Biomedical Informatics, Emory University.
Proc ACM SIGSPATIAL Int Conf Adv Inf. 2012 Nov 6;2012:309-318. doi: 10.1145/2424321.2424361.
Support of high performance queries on large volumes of scientific spatial data is becoming increasingly important in many applications. This growth is driven by not only geospatial problems in numerous fields, but also emerging scientific applications that are increasingly data- and compute-intensive. For example, digital pathology imaging has become an emerging field during the past decade, where examination of high resolution images of human tissue specimens enables more effective diagnosis, prediction and treatment of diseases. Systematic analysis of large-scale pathology images generates tremendous amounts of spatially derived quantifications of micro-anatomic objects, such as nuclei, blood vessels, and tissue regions. Analytical pathology imaging provides high potential to support image based computer aided diagnosis. One major requirement for this is effective of such enormous amount of data with fast response, which is faced with two major challenges: the "big data" challenge and the high computation complexity. In this paper, we present our work towards building a high performance spatial query system for querying massive spatial data on MapReduce. Our framework takes an on demand index building approach for processing spatial queries and a partition-merge approach for building parallel spatial query pipelines, which fits nicely with the computing model of MapReduce. We demonstrate our framework on supporting multi-way spatial joins for algorithm evaluation and nearest neighbor queries for microanatomic objects. To reduce query response time, we propose cost based query optimization to mitigate the effect of data skew. Our experiments show that the framework can efficiently support complex analytical spatial queries on MapReduce.
在许多应用中,支持对大量科学空间数据进行高性能查询变得越来越重要。这种增长不仅受到众多领域中地理空间问题的推动,还受到越来越多的数据密集型和计算密集型新兴科学应用的推动。例如,数字病理学成像在过去十年中已成为一个新兴领域,对人体组织标本的高分辨率图像进行检查能够实现更有效的疾病诊断、预测和治疗。对大规模病理学图像的系统分析会生成大量关于微观解剖对象(如细胞核、血管和组织区域)的空间量化数据。分析性病理学成像为支持基于图像的计算机辅助诊断提供了巨大潜力。对此的一个主要要求是能够快速响应地处理如此大量的数据,而这面临着两个主要挑战:“大数据”挑战和高计算复杂性。在本文中,我们展示了我们在构建一个用于在MapReduce上查询海量空间数据的高性能空间查询系统方面所做的工作。我们的框架采用按需索引构建方法来处理空间查询,并采用分区合并方法来构建并行空间查询管道,这与MapReduce的计算模型非常契合。我们展示了我们的框架在支持用于算法评估的多路空间连接和用于微观解剖对象的最近邻查询方面的能力。为了减少查询响应时间,我们提出基于成本的查询优化以减轻数据倾斜的影响。我们的实验表明,该框架能够在MapReduce上高效地支持复杂的分析性空间查询。