Vera José Fernando, de Rooij Mark, Heiser Willem J
University of Granada, Spain.
Br J Math Stat Psychol. 2014 Nov;67(3):514-40. doi: 10.1111/bmsp.12038. Epub 2014 Mar 24.
In this paper we propose a latent class distance association model for clustering in the predictor space of large contingency tables with a categorical response variable. The rows of such a table are characterized as profiles of a set of explanatory variables, while the columns represent a single outcome variable. In many cases such tables are sparse, with many zero entries, which makes traditional models problematic. By clustering the row profiles into a few specific classes and representing these together with the categories of the response variable in a low-dimensional Euclidean space using a distance association model, a parsimonious prediction model can be obtained. A generalized EM algorithm is proposed to estimate the model parameters and the adjusted Bayesian information criterion statistic is employed to test the number of mixture components and the dimensionality of the representation. An empirical example highlighting the advantages of the new approach and comparing it with traditional approaches is presented.
在本文中,我们提出了一种潜在类别距离关联模型,用于在具有分类响应变量的大型列联表的预测变量空间中进行聚类。此类表格的行被表征为一组解释变量的概况,而列则代表单个结果变量。在许多情况下,此类表格是稀疏的,有许多零条目,这使得传统模型存在问题。通过将行概况聚类为几个特定类别,并使用距离关联模型在低维欧几里得空间中将这些类别与响应变量的类别一起表示,可以获得一个简约的预测模型。提出了一种广义期望最大化(EM)算法来估计模型参数,并采用调整后的贝叶斯信息准则统计量来检验混合成分的数量和表示的维度。给出了一个实证例子,突出了新方法的优点并将其与传统方法进行比较。