Zeighami Sepanta, Shahabi Cyrus, Sharan Vatsal
University of Southern California, USA.
Proc ACM Manag Data. 2023 May;1(1). doi: 10.1145/3588954. Epub 2023 May 30.
Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling "queries" rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by developing NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy.
范围聚合查询(RAQs)是许多实际应用的一个组成部分,在这些应用中,通常希望对查询有快速且近似的答案。最近的工作研究了使用机器学习(ML)模型来回答RAQs,即学习数据模型以回答查询。然而,对于基于ML的方法为何以及何时表现良好,尚无理论上的理解。此外,由于ML方法对数据进行建模,它们未能利用任何特定于查询的信息来在实际中提高性能。在本文中,我们专注于对“查询”进行建模而不是对数据进行建模,并训练神经网络来学习查询答案。这种关注点的转变使我们能够从理论上研究我们的ML方法,以便在回答RAQs时为神经网络提供一个与分布和查询相关的误差界限。我们通过开发NeuroSketch(一个在实际中回答RAQs的神经网络框架)来证实我们的理论结果。在真实世界、TPC基准和合成数据集上进行的广泛实验研究表明,NeuroSketch回答RAQs的速度比现有技术快多个数量级,并且具有更高的准确性。