Russo Massimiliano, Singer Burton H, Dunson David B
Harvard Medical School, and Dana-Farber Cancer Institute.
University of Florida.
Ann Appl Stat. 2022 Mar;16(1):391-413. doi: 10.1214/21-aoas1496. Epub 2022 Mar 28.
Characterizing the shared memberships of individuals in a classification scheme poses severe interpretability issues, even when using a moderate number of classes (say 4). Mixed membership models quantify this phenomenon, but they typically focus on goodness-of-fit more than on interpretable inference. To achieve a good numerical fit, these models may in fact require many extreme profiles, making the results difficult to interpret. We introduce a new class of multivariate mixed membership models that, when variables can be partitioned into subject-matter based domains, can provide a good fit to the data using fewer profiles than standard formulations. The proposed model explicitly accounts for the blocks of variables corresponding to the distinct domains along with a cross-domain correlation structure, which provides new information about shared membership of individuals in a complex classification scheme. We specify a multivariate logistic normal distribution for the membership vectors, which allows easy introduction of auxiliary information leveraging a latent multivariate logistic regression. A Bayesian approach to inference, relying on Pólya gamma data augmentation, facilitates efficient posterior computation via Markov Chain Monte Carlo. We apply this methodology to a spatially explicit study of malaria risk over time on the Brazilian Amazon frontier.
在一个分类体系中刻画个体的共享成员身份会带来严重的可解释性问题,即使使用的类别数量适中(比如4个)。混合成员模型对这种现象进行了量化,但它们通常更关注拟合优度而非可解释的推断。为了实现良好的数值拟合,这些模型实际上可能需要许多极端概况,这使得结果难以解释。我们引入了一类新的多变量混合成员模型,当变量可以基于主题划分为不同领域时,该模型能够使用比标准公式更少的概况来很好地拟合数据。所提出的模型明确考虑了与不同领域相对应的变量块以及跨领域相关结构,这为复杂分类体系中个体的共享成员身份提供了新信息。我们为成员向量指定了一个多变量逻辑正态分布,这允许通过潜在的多变量逻辑回归轻松引入辅助信息。基于波利亚伽马数据增强的贝叶斯推断方法,通过马尔可夫链蒙特卡罗促进了高效的后验计算。我们将这种方法应用于对巴西亚马逊边境地区疟疾风险随时间变化的空间明确研究。