Ingber Amir, Courtade Thomas, Weissman Tsachy
Department of Electrical Engineering, Stanford University, Stanford, CA 94305 USA.
IEEE Trans Inf Theory. 2015 May;61(5):2729-2747. doi: 10.1109/tit.2015.2402972. Epub 2015 Feb 11.
The problem of performing similarity queries on compressed data is considered. We focus on the quadratic similarity measure, and study the fundamental tradeoff between compression rate, sequence length, and reliability of queries performed on the compressed data. For a Gaussian source, we show that the queries can be answered reliably if and only if the compression rate exceeds a given threshold-the - which we explicitly characterize. Moreover, when compression is performed at a rate greater than the identification rate, responses to queries on the compressed data can be made exponentially reliable. We give a complete characterization of this exponent, which is analogous to the error and excess-distortion exponents in channel and source coding, respectively. For a general source, we prove that, as with classical compression, the Gaussian source requires the largest compression rate among sources with a given variance. Moreover, a robust scheme is described that attains this maximal rate for any source distribution.
考虑了对压缩数据执行相似性查询的问题。我们专注于二次相似性度量,并研究压缩率、序列长度以及对压缩数据执行查询的可靠性之间的基本权衡。对于高斯源,我们表明,当且仅当压缩率超过给定阈值时,查询才能可靠地得到回答,我们明确刻画了该阈值。此外,当以大于识别率的速率进行压缩时,对压缩数据的查询响应可以具有指数可靠性。我们对该指数进行了完整的刻画,它分别类似于信道编码和源编码中的错误指数和超额失真指数。对于一般源,我们证明,与经典压缩一样,在具有给定方差的源中,高斯源需要最大的压缩率。此外,还描述了一种鲁棒方案,该方案对于任何源分布都能达到这个最大速率。