Li Zhenlong, Yang Chaowei, Jin Baoxuan, Yu Manzhu, Liu Kai, Sun Min, Zhan Matthew
NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America.
NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America; Yunnan Provincial Geomatics Center, Yunnan Bureau of Surveying, Mapping, and GeoInformation, Kunming,Yunnan, China.
PLoS One. 2015 Mar 5;10(3):e0116781. doi: 10.1371/journal.pone.0116781. eCollection 2015.
Geoscience observations and model simulations are generating vast amounts of multi-dimensional data. Effectively analyzing these data are essential for geoscience studies. However, the tasks are challenging for geoscientists because processing the massive amount of data is both computing and data intensive in that data analytics requires complex procedures and multiple tools. To tackle these challenges, a scientific workflow framework is proposed for big geoscience data analytics. In this framework techniques are proposed by leveraging cloud computing, MapReduce, and Service Oriented Architecture (SOA). Specifically, HBase is adopted for storing and managing big geoscience data across distributed computers. MapReduce-based algorithm framework is developed to support parallel processing of geoscience data. And service-oriented workflow architecture is built for supporting on-demand complex data analytics in the cloud environment. A proof-of-concept prototype tests the performance of the framework. Results show that this innovative framework significantly improves the efficiency of big geoscience data analytics by reducing the data processing time as well as simplifying data analytical procedures for geoscientists.
地球科学观测和模型模拟正在生成大量的多维数据。有效分析这些数据对于地球科学研究至关重要。然而,这些任务对地球科学家来说具有挑战性,因为处理海量数据在计算和数据方面都要求很高,这是由于数据分析需要复杂的程序和多种工具。为应对这些挑战,提出了一个用于大型地球科学数据分析的科学工作流框架。在这个框架中,通过利用云计算、MapReduce和面向服务的架构(SOA)提出了相关技术。具体来说,采用HBase来跨分布式计算机存储和管理大型地球科学数据。开发了基于MapReduce的算法框架以支持地球科学数据的并行处理。并且构建了面向服务的工作流架构以支持云环境中按需进行的复杂数据分析。一个概念验证原型测试了该框架的性能。结果表明,这个创新框架通过减少数据处理时间以及简化地球科学家的数据分析程序,显著提高了大型地球科学数据分析的效率。