Su Ya, Reedy Jill, Carroll Raymond J
Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143.
Epidemiology and Genomics Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD 20892.
Stat Sin. 2018 Oct;28(4):2337-2351.
This paper is dedicated to the memory of Peter G. Hall. It concerns a deceptively simple question: if one observes variables corrupted with measurement error of possibly very complex form, can one recreate asymptotically the clusters that would have been found had there been no measurement error? We show that the answer is yes, and that the solution is surprisingly simple and general. The method itself is to simulate, by computer, realizations with the same distribution as that of the true variables, and then to apply clustering to these realizations. Technically, we show that if one uses K-means clustering or any other risk minimizing clustering, and a multivariate deconvolution device with certain smoothness and convergence properties, then, in the limit, the cluster means based on our method converge to the same cluster means as if there is no measurement error. Along with the method and its technical justification, we analyze two important nutrition data sets, finding patterns that make sense nutritionally.
本文谨献给彼得·G·霍尔。它涉及一个看似简单实则不然的问题:如果观察到的变量受到可能非常复杂形式的测量误差影响,那么能否渐近地重建在没有测量误差时会发现的聚类?我们证明答案是肯定的,而且解决方案出奇地简单且具有通用性。该方法本身是通过计算机模拟与真实变量具有相同分布的实现,然后对这些实现应用聚类。从技术层面讲,我们表明如果使用K均值聚类或任何其他风险最小化聚类,以及具有某些平滑性和收敛性的多元反卷积装置,那么在极限情况下,基于我们方法的聚类均值会收敛到与没有测量误差时相同的聚类均值。除了该方法及其技术依据,我们还分析了两个重要的营养数据集,发现了具有营养意义的模式。