Shkapsky Alexander, Yang Mohan, Interlandi Matteo, Chiu Hsuan, Condie Tyson, Zaniolo Carlo
University of California, Los Angeles.
Proc ACM SIGMOD Int Conf Manag Data. 2016 Jun-Jul;2016:1135-1149. doi: 10.1145/2882903.2915229.
There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.
人们对利用云计算平台提供的机会进行大规模分析有着浓厚兴趣。在这些平台中,Apache Spark在机器学习和图分析方面越来越受欢迎。在Spark中开发高效的复杂分析需要深入理解手头的算法以及Spark API或子系统API(例如,Spark SQL、GraphX)。我们的BigDatalog系统通过提供适合高效评估的复杂查询的简洁声明式规范来解决这个问题。为了实现这一目标,我们提出了编译和优化技术,以解决在Spark中有效支持递归这一重要问题。我们与其他最先进的大规模Datalog系统进行了实验比较,并验证了我们技术的有效性以及Spark在支持基于Datalog的分析方面的有效性。