Zeighami Sepanta, Shahabi Cyrus
University of California, Berkeley. Work done while at USC's Infolab.
University of Southern California.
Proc Mach Learn Res. 2024 Jul;235:58283-58305.
Use of machine learning to perform database operations, such as indexing, cardinality estimation, and sorting, is shown to provide substantial performance benefits. However, when datasets change and data distribution shifts, empirical results also show performance degradation for learned models, possibly to worse than non-learned alternatives. This, together with a lack of theoretical understanding of learned methods undermines their practical applicability, since there are no guarantees on how well the models will perform after deployment. In this paper, we present the first known theoretical characterization of the performance of learned models in dynamic datasets, for the aforementioned operations. Our results show novel theoretical characteristics achievable by learned models and provide bounds on the performance of the models that characterize their advantages over non-learned methods, showing why and when learned models can outperform the alternatives. Our analysis develops the framework and novel theoretical tools which build the foundation for the analysis of learned database operations in the future.
机器学习用于执行数据库操作(如索引、基数估计和排序)已被证明能带来显著的性能提升。然而,当数据集发生变化且数据分布发生偏移时,实证结果也表明,已学习模型的性能会下降,甚至可能比未学习的替代方法更差。这一点,再加上对已学习方法缺乏理论理解,削弱了它们的实际适用性,因为无法保证模型在部署后的性能表现。在本文中,我们针对上述操作,给出了动态数据集中已学习模型性能的首个已知理论特征。我们的结果展示了已学习模型可实现的新颖理论特征,并给出了模型性能的界限,这些界限刻画了它们相对于未学习方法的优势,说明了已学习模型能够超越替代方法的原因和时机。我们的分析构建了框架和新颖的理论工具,为未来已学习数据库操作的分析奠定了基础。