优化数据密集型应用程序的交互式开发。

Optimizing Interactive Development of Data-Intensive Applications.

作者信息

Interlandi Matteo, Tetali Sai Deep, Gulzar Muhammad Ali, Noor Joseph, Condie Tyson, Kim Miryung, Millstein Todd

机构信息

University of California, Los Angeles.

出版信息

Proc ACM Symp Cloud Comput. 2016 Oct;2016:510-522. doi: 10.1145/2987550.2987565.

DOI:10.1145/2987550.2987565

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5386325/

Abstract

Modern Data-Intensive Scalable Computing (DISC) systems are designed to process data through batch jobs that execute programs (e.g., queries) compiled from a high-level language. These programs are often developed interactively by posing ad-hoc queries over the base data until a desired result is generated. We observe that there can be significant overlap in the structure of these queries used to derive the final program. Yet, each successive execution of a slightly modified query is performed anew, which can significantly increase the development cycle. Vega is an Apache Spark framework that we have implemented for optimizing a series of similar Spark programs, likely originating from a development or exploratory data analysis session. Spark developers (e.g., data scientists) can leverage Vega to significantly reduce the amount of time it takes to re-execute a modified Spark program, reducing the overall time to market for their Big Data applications.

摘要

现代数据密集型可扩展计算（DISC）系统旨在通过批处理作业来处理数据，这些批处理作业执行从高级语言编译而来的程序（例如查询）。这些程序通常是通过对基础数据提出即席查询以交互式方式开发的，直到生成所需结果。我们观察到，用于得出最终程序的这些查询的结构可能存在显著重叠。然而，每次对稍有修改的查询进行连续执行时都要重新进行，这会显著延长开发周期。Vega是我们为优化一系列可能源自开发或探索性数据分析会话的相似Spark程序而实现的一个Apache Spark框架。Spark开发者（例如数据科学家）可以利用Vega显著减少重新执行修改后的Spark程序所需的时间，从而减少其大数据应用的整体上市时间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/46c0/5386325/f57af58d2bb2/nihms849619f1.jpg

相似文献

1

Optimizing Interactive Development of Data-Intensive Applications.优化数据密集型应用程序的交互式开发。

Proc ACM Symp Cloud Comput. 2016 Oct;2016:510-522. doi: 10.1145/2987550.2987565.

2

An adaptive spark-based framework for querying large-scale NoSQL and relational databases.一种适用于查询大规模 NoSQL 和关系型数据库的基于火花的自适应框架。

PLoS One. 2021 Aug 19;16(8):e0255562. doi: 10.1371/journal.pone.0255562. eCollection 2021.

3

SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing.SparkGIS：资源感知型高效内存空间查询处理

Proc ACM SIGSPATIAL Int Conf Adv Inf. 2017 Nov;2017.

4

Adding data provenance support to Apache Spark.为Apache Spark添加数据起源支持。

VLDB J. 2018 Oct;27(5):595-615. doi: 10.1007/s00778-017-0474-5. Epub 2017 Aug 7.

5

Titian: Data Provenance Support in Spark.《提香：Spark 中的数据起源支持》

Proceedings VLDB Endowment. 2015 Nov;9(3):216-227.

6

BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark.BigDebug：用于在Spark中进行交互式大数据处理的调试原语。

Proc Int Conf Softw Eng. 2016 May;2016:784-795. doi: 10.1145/2884781.2884813.

7

Efficient processing of complex XSD using Hive and Spark.使用Hive和Spark对复杂XSD进行高效处理。

PeerJ Comput Sci. 2021 Aug 17;7:e652. doi: 10.7717/peerj-cs.652. eCollection 2021.

8

iQCAR: inter-Query Contention Analyzer for Data Analytics Frameworks.iQCAR：用于数据分析框架的查询间争用分析器

Proc ACM SIGMOD Int Conf Manag Data. 2019 Jun;2019:918-935. doi: 10.1145/3299869.3319904.

9

LocationSpark: In-memory Distributed Spatial Query Processing and Optimization.位置Spark：内存中分布式空间查询处理与优化

Front Big Data. 2020 Oct 16;3:30. doi: 10.3389/fdata.2020.00030. eCollection 2020.

10

SOOM: Sort-Based Optimizer for Big Data Multi-Query.SOOM：大数据多查询的基于排序的优化器。

Big Data. 2020 Feb;8(1):38-61. doi: 10.1089/big.2019.0023. Epub 2020 Jan 30.

引用本文的文献

1

Automated Debugging in Data-Intensive Scalable Computing.数据密集型可扩展计算中的自动调试

Proc ACM Symp Cloud Comput. 2017 Sep;2017:520-534. doi: 10.1145/3127479.3131624.

2

Adding data provenance support to Apache Spark.为Apache Spark添加数据起源支持。

VLDB J. 2018 Oct;27(5):595-615. doi: 10.1007/s00778-017-0474-5. Epub 2017 Aug 7.

本文引用的文献

1

Big Data Analytics with Datalog Queries on Spark.在Spark上使用Datalog查询进行大数据分析。

Proc ACM SIGMOD Int Conf Manag Data. 2016 Jun-Jul;2016:1135-1149. doi: 10.1145/2882903.2915229.

2

BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark.BigDebug：用于在Spark中进行交互式大数据处理的调试原语。

Proc Int Conf Softw Eng. 2016 May;2016:784-795. doi: 10.1145/2884781.2884813.

3

Titian: Data Provenance Support in Spark.《提香：Spark 中的数据起源支持》

Proceedings VLDB Endowment. 2015 Nov;9(3):216-227.

4

Enterprise Data Analysis and Visualization: An Interview Study.企业数据分析与可视化：一项访谈研究。

IEEE Trans Vis Comput Graph. 2012 Dec;18(12):2917-26. doi: 10.1109/TVCG.2012.219.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验