Interlandi Matteo, Tetali Sai Deep, Gulzar Muhammad Ali, Noor Joseph, Condie Tyson, Kim Miryung, Millstein Todd
University of California, Los Angeles.
Proc ACM Symp Cloud Comput. 2016 Oct;2016:510-522. doi: 10.1145/2987550.2987565.
Modern Data-Intensive Scalable Computing (DISC) systems are designed to process data through batch jobs that execute programs (e.g., queries) compiled from a high-level language. These programs are often developed interactively by posing ad-hoc queries over the base data until a desired result is generated. We observe that there can be significant overlap in the structure of these queries used to derive the final program. Yet, each successive execution of a slightly modified query is performed anew, which can significantly increase the development cycle. Vega is an Apache Spark framework that we have implemented for optimizing a series of similar Spark programs, likely originating from a development or exploratory data analysis session. Spark developers (e.g., data scientists) can leverage Vega to significantly reduce the amount of time it takes to re-execute a modified Spark program, reducing the overall time to market for their Big Data applications.
现代数据密集型可扩展计算(DISC)系统旨在通过批处理作业来处理数据,这些批处理作业执行从高级语言编译而来的程序(例如查询)。这些程序通常是通过对基础数据提出即席查询以交互式方式开发的,直到生成所需结果。我们观察到,用于得出最终程序的这些查询的结构可能存在显著重叠。然而,每次对稍有修改的查询进行连续执行时都要重新进行,这会显著延长开发周期。Vega是我们为优化一系列可能源自开发或探索性数据分析会话的相似Spark程序而实现的一个Apache Spark框架。Spark开发者(例如数据科学家)可以利用Vega显著减少重新执行修改后的Spark程序所需的时间,从而减少其大数据应用的整体上市时间。