Suppr超能文献

存在混合变量类型和缺失数据时的聚类与变量选择

Clustering and variable selection in the presence of mixed variable types and missing data.

作者信息

Storlie C B, Myers S M, Katusic S K, Weaver A L, Voigt R G, Croarkin P E, Stoeckel R E, Port J D

机构信息

Mayo Clinic, Rochester, USA.

Geisinger Autism & Developmental Medicine Institute, Lewisburg, USA.

出版信息

Stat Med. 2018 May 17. doi: 10.1002/sim.7697.

Abstract

We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.

摘要

我们考虑在存在许多相关的、混合的连续和离散变量(其中一些可能存在缺失值)的情况下基于模型的聚类问题。离散变量采用潜在连续变量方法处理,狄利克雷过程用于构建具有未知数量成分的混合模型。还进行变量选择以识别对确定聚类成员最有影响的变量。这项工作的动机源于需要根据许多认知和/或行为测试分数对被认为可能患有自闭症谱系障碍的患者进行聚类。数据集中有数量适中的患者(486名)以及许多(55个)测试分数变量(其中许多是离散值和/或缺失值)。这项工作的目标是:(1)将这些患者聚类为相似的组,以帮助识别具有相似临床表现的患者;(2)识别为聚类提供信息的稀疏测试子集,以消除不必要的测试。通过对这类问题的模拟,所提出的方法与其他方法相比具有很大优势。自闭症谱系障碍分析的结果表明最有可能分为3个聚类,而只有4个测试分数具有较高(>0.5)的后验概率表明其具有信息性。这将导致更高效和更具信息性的测试。基于许多相关的、具有缺失值的连续/离散变量对观测值进行聚类的需求在健康科学以及许多其他学科中都是一个常见问题。

相似文献

7
Clustering high-dimensional mixed data to uncover sub-phenotypes: joint analysis of phenotypic and genotypic data.
Stat Med. 2017 Dec 10;36(28):4548-4569. doi: 10.1002/sim.7371. Epub 2017 Jun 30.

本文引用的文献

1
Simplex Factor Models for Multivariate Unordered Categorical Data.多元无序分类数据的单纯形因子模型
J Am Stat Assoc. 2012 Mar 1;107(497):362-377. doi: 10.1080/01621459.2011.646934.
3
MissForest--non-parametric missing value imputation for mixed-type data.MissForest--用于混合类型数据的非参数缺失值插补。
Bioinformatics. 2012 Jan 1;28(1):112-8. doi: 10.1093/bioinformatics/btr597. Epub 2011 Oct 28.
4
A framework for feature selection in clustering.一种用于聚类中特征选择的框架。
J Am Stat Assoc. 2010 Jun 1;105(490):713-726. doi: 10.1198/jasa.2010.tm09415.
7
Variable selection for clustering with Gaussian mixture models.用于高斯混合模型聚类的变量选择
Biometrics. 2009 Sep;65(3):701-9. doi: 10.1111/j.1541-0420.2008.01160.x. Epub 2009 Feb 4.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验