Storlie C B, Myers S M, Katusic S K, Weaver A L, Voigt R G, Croarkin P E, Stoeckel R E, Port J D
Mayo Clinic, Rochester, USA.
Geisinger Autism & Developmental Medicine Institute, Lewisburg, USA.
Stat Med. 2018 May 17. doi: 10.1002/sim.7697.
We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.
我们考虑在存在许多相关的、混合的连续和离散变量(其中一些可能存在缺失值)的情况下基于模型的聚类问题。离散变量采用潜在连续变量方法处理,狄利克雷过程用于构建具有未知数量成分的混合模型。还进行变量选择以识别对确定聚类成员最有影响的变量。这项工作的动机源于需要根据许多认知和/或行为测试分数对被认为可能患有自闭症谱系障碍的患者进行聚类。数据集中有数量适中的患者(486名)以及许多(55个)测试分数变量(其中许多是离散值和/或缺失值)。这项工作的目标是:(1)将这些患者聚类为相似的组,以帮助识别具有相似临床表现的患者;(2)识别为聚类提供信息的稀疏测试子集,以消除不必要的测试。通过对这类问题的模拟,所提出的方法与其他方法相比具有很大优势。自闭症谱系障碍分析的结果表明最有可能分为3个聚类,而只有4个测试分数具有较高(>0.5)的后验概率表明其具有信息性。这将导致更高效和更具信息性的测试。基于许多相关的、具有缺失值的连续/离散变量对观测值进行聚类的需求在健康科学以及许多其他学科中都是一个常见问题。