McLachlan G J, Rathnayake Suren I
Department of Mathematics, University of Queensland, St. Lucia, Queensland, Australia.
J Biopharm Stat. 2011 Nov;21(6):1113-25. doi: 10.1080/10543406.2011.608342.
With the use of finite mixture models for the clustering of a data set, the crucial question of how many clusters there are in the data can be addressed by testing for the smallest number of components in the mixture model compatible with the data. We investigate the performance of a resampling approach to this latter problem in the context of high-dimensional data, where the number of variables p is extremely large relative to the number of observations n. In order to be able to fit normal mixture models to such data, some form of dimension reduction has to be performed. This raises the question of whether a practically significant bias results if the bootstrapping is undertaken solely on the basis of the reduced dimensional form of the data, rather than using the full data from which to draw the bootstrap sample replications.