Cormode Graham, Dickens Charlie, Woodruff David P
University of Warwick.
Carnegie Mellon University.
Proc ACM SIGACT SIGMOD SIGART Symp Princ Database Syst. 2021 Jun;2021:273-284. doi: 10.1145/3452021.3458312. Epub 2021 Jun 20.
Given an × dimensional dataset , a projection query specifies a subset ⊆ [] of columns which yields a new × || array. We study the space complexity of computing data analysis functions over such subspaces, including heavy hitters and norms, when the subspaces are revealed only after observing the data. We show that this important class of problems is typically hard: for many problems, we show 2 lower bounds. However, we present upper bounds which demonstrate space dependency better than 2 . That is, for ' ∈ (0, 1) and a parameter = 2 an -approximation can be obtained in space , showing that it is possible to improve on the naïve approach of keeping information for all 2 subsets of columns. Our results are based on careful constructions of instances using coding theory and novel combinatorial reductions that exhibit such space-approximation tradeoffs.
给定一个 (n\times d) 维数据集 (D),投影查询指定列的一个子集 (S\subseteq [d]),它会产生一个新的 (n\times |S|) 数组。当仅在观察数据之后才揭示子空间时,我们研究在此类子空间上计算数据分析函数(包括频繁项和范数)的空间复杂度。我们表明这类重要的问题通常很难:对于许多问题,我们给出了 (2^d) 的下界。然而,我们给出的上界表明空间依赖性优于 (2^d)。也就是说,对于 (\epsilon\in(0,1)) 和参数 (k = 2^{d\epsilon}),可以在空间 (O(nk)) 中获得 (\epsilon)-近似,这表明有可能改进为所有 (d) 列的 (2^d) 个子集保留信息的朴素方法。我们的结果基于使用编码理论精心构造实例以及展示此类空间近似权衡的新颖组合约简。