Bhattacharya Anirban, Dunson David B
Department of Statistical Science, Duke University, NC 27708.
J Am Stat Assoc. 2012 Mar 1;107(497):362-377. doi: 10.1080/01621459.2011.646934.
Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary, and ordered categorical data. For unordered categorical variables, Gaussian latent factor models lead to challenging computation and complex modeling structures. As an alternative, we propose a novel class of simplex factor models. In the single-factor case, the model treats the different categorical outcomes as independent with unknown marginals. The model can characterize flexible dependence structures parsimoniously with few factors, and as factors are added, any multivariate categorical data distribution can be accurately approximated. Using a Bayesian approach for computation and inferences, a Markov chain Monte Carlo (MCMC) algorithm is proposed that scales well with increasing dimension, with the number of factors treated as unknown. We develop an efficient proposal for updating the base probability vector in hierarchical Dirichlet models. Theoretical properties are described, and we evaluate the approach through simulation examples. Applications are described for modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features.
高斯潜在因子模型通常用于对连续、二元和有序分类数据中的相关性进行建模。对于无序分类变量,高斯潜在因子模型会导致具有挑战性的计算和复杂的建模结构。作为一种替代方法,我们提出了一类新颖的单纯形因子模型。在单因子情况下,该模型将不同的分类结果视为具有未知边际分布的独立变量。该模型可以用较少的因子简洁地刻画灵活的依赖结构,并且随着因子的增加,可以准确地近似任何多元分类数据分布。使用贝叶斯方法进行计算和推断,我们提出了一种马尔可夫链蒙特卡罗(MCMC)算法,该算法随着维度的增加具有良好的扩展性,其中因子的数量被视为未知。我们开发了一种有效的提议,用于在分层狄利克雷模型中更新基础概率向量。描述了理论性质,并通过模拟示例评估了该方法。描述了该方法在核苷酸序列相关性建模和高维分类特征预测中的应用。