Kumar Vijay S, Sadayappan P, Mehta Gaurang, Vahi Karan, Deelman Ewa, Ratnakar Varun, Kim Jihie, Gil Yolanda, Hall Mary, Kurc Tahsin, Saltz Joel
Proc Int Symp High Perform Distrib Comput. 2009:177-186. doi: 10.1145/1551609.1551638.
Data analysis processes in scientific applications can be expressed as coarse-grain workflows of complex data processing operations with data flow dependencies between them. Performance optimization of these workflows can be viewed as a search for a set of optimal values in a multi-dimensional parameter space. While some performance parameters such as grouping of workflow components and their mapping to machines do not a ect the accuracy of the output, others may dictate trading the output quality of individual components (and of the whole workflow) for performance. This paper describes an integrated framework which is capable of supporting performance optimizations along multiple dimensions of the parameter space. Using two real-world applications in the spatial data analysis domain, we present an experimental evaluation of the proposed framework.
科学应用中的数据分析过程可以表示为复杂数据处理操作的粗粒度工作流,这些操作之间存在数据流依赖性。这些工作流的性能优化可以看作是在多维参数空间中寻找一组最优值。虽然一些性能参数,如工作流组件的分组及其在机器上的映射,不会影响输出的准确性,但其他参数可能要求以单个组件(以及整个工作流)的输出质量为代价来换取性能。本文描述了一个能够支持沿参数空间多个维度进行性能优化的集成框架。通过空间数据分析领域的两个实际应用,我们对所提出的框架进行了实验评估。