Kumar Anand, Grupcev Vladimir, Yuan Yongke, Tu Yi-Cheng, Shen Gang
Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB 118, Tampa, FL 33620, USA.
School of Economics and Management, Beijing University of Technology, 100 Pingleyuan, Chaoyang District, Beijing 100124, China.
Adv Database Technol. 2012. doi: 10.1145/2247596.2247631.
Large data generated by scientific applications imposes challenges in storage and efficient query processing. Many queries against scientific data are analytical in nature and require super-linear computation time using straightforward methods. Spatial distance histogram (SDH) is one of the basic queries to analyze the molecular simulation (MS) data, and it takes quadratic time to compute using brute-force approach. Often, an SDH query is executed continuously to analyze the simulation system over a period of time. This adds to the total time required to compute SDH. In this paper, we propose an approximate algorithm to compute SDH efficiently over consecutive time periods. In our approach, data is organized into a Quad-tree based data structure. The spatial locality of the particles (at given time) in each node of the tree is acquired to determine the particle distribution. Similarly, the temporal locality of particles (between consecutive time periods) in each node is also acquired. The spatial distribution and temporal locality are utilized to compute the approximate SDH at every time instant. The performance is boosted by storing and updating the spatial distribution information over time. The efficiency and accuracy of the proposed algorithm is supported by mathematical analysis and results of extensive experiments using biological data generated from real MS studies.
科学应用程序生成的大数据给存储和高效查询处理带来了挑战。许多针对科学数据的查询本质上是分析性的,使用直接方法需要超线性计算时间。空间距离直方图(SDH)是分析分子模拟(MS)数据的基本查询之一,使用暴力方法计算需要二次时间。通常,会连续执行SDH查询以在一段时间内分析模拟系统。这增加了计算SDH所需的总时间。在本文中,我们提出了一种近似算法,用于在连续时间段内高效计算SDH。在我们的方法中,数据被组织成基于四叉树的数据结构。获取树中每个节点(在给定时间)的粒子空间局部性以确定粒子分布。类似地,也获取每个节点中粒子(在连续时间段之间)的时间局部性。利用空间分布和时间局部性在每个时刻计算近似SDH。通过随时间存储和更新空间分布信息来提高性能。所提算法的效率和准确性得到了数学分析以及使用真实MS研究生成的生物学数据进行的大量实验结果的支持。