Ma Yi, Derksen Harm, Hong Wei
Electrical and Computer Engineering Department, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
IEEE Trans Pattern Anal Mach Intell. 2007 Sep;29(9):1546-62. doi: 10.1109/TPAMI.2007.1085.
In this paper, based on ideas from lossy data coding and compression, we present a simple but effective technique for segmenting multivariate mixed data that are drawn from a mixture of Gaussian distributions, which are allowed to be almost degenerate. The goal is to find the optimal segmentation that minimizes the overall coding length of the segmented data, subject to a given distortion. By analyzing the coding length/rate of mixed data, we formally establish some strong connections of data segmentation to many fundamental concepts in lossy data compression and rate distortion theory. We show that a deterministic segmentation is approximately the (asymptotically) optimal solution for compressing mixed data. We propose a very simple and effective algorithm which depends on a single parameter, the allowable distortion. At any given distortion, the algorithm automatically determines the corresponding number and dimension of the groups and does not involve any parameter estimation. Simulation results reveal intriguing phase-transition-like behaviors of the number of segments when changing the level of distortion or the amount of outliers. Finally, we demonstrate how this technique can be readily applied to segment real imagery and bioinformatic data.
在本文中,基于有损数据编码和压缩的思想,我们提出了一种简单而有效的技术,用于对源自高斯分布混合的多元混合数据进行分割,这些高斯分布允许几乎退化。目标是找到在给定失真条件下使分割后数据的总编码长度最小化的最优分割。通过分析混合数据的编码长度/速率,我们正式建立了数据分割与有损数据压缩和率失真理论中许多基本概念的紧密联系。我们表明,确定性分割近似于(渐近地)压缩混合数据的最优解。我们提出了一种非常简单有效的算法,该算法依赖于单个参数,即允许的失真。在任何给定的失真下,该算法会自动确定相应的组的数量和维度,并且不涉及任何参数估计。模拟结果揭示了在改变失真水平或异常值数量时,段数呈现出有趣的类似相变的行为。最后,我们展示了该技术如何能够轻松应用于分割真实图像和生物信息数据。