Suppr超能文献

存在混合变量类型和缺失数据时的聚类与变量选择

Clustering and variable selection in the presence of mixed variable types and missing data.

作者信息

Storlie C B, Myers S M, Katusic S K, Weaver A L, Voigt R G, Croarkin P E, Stoeckel R E, Port J D

机构信息

Mayo Clinic, Rochester, USA.

Geisinger Autism & Developmental Medicine Institute, Lewisburg, USA.

出版信息

Stat Med. 2018 May 17. doi: 10.1002/sim.7697.

Abstract

We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.

摘要

我们考虑在存在许多相关的、混合的连续和离散变量(其中一些可能存在缺失值)的情况下基于模型的聚类问题。离散变量采用潜在连续变量方法处理,狄利克雷过程用于构建具有未知数量成分的混合模型。还进行变量选择以识别对确定聚类成员最有影响的变量。这项工作的动机源于需要根据许多认知和/或行为测试分数对被认为可能患有自闭症谱系障碍的患者进行聚类。数据集中有数量适中的患者(486名)以及许多(55个)测试分数变量(其中许多是离散值和/或缺失值)。这项工作的目标是:(1)将这些患者聚类为相似的组,以帮助识别具有相似临床表现的患者;(2)识别为聚类提供信息的稀疏测试子集,以消除不必要的测试。通过对这类问题的模拟,所提出的方法与其他方法相比具有很大优势。自闭症谱系障碍分析的结果表明最有可能分为3个聚类,而只有4个测试分数具有较高(>0.5)的后验概率表明其具有信息性。这将导致更高效和更具信息性的测试。基于许多相关的、具有缺失值的连续/离散变量对观测值进行聚类的需求在健康科学以及许多其他学科中都是一个常见问题。

相似文献

3
From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering.
Adv Data Anal Classif. 2019;13(1):33-64. doi: 10.1007/s11634-018-0329-y. Epub 2018 Aug 24.
5
PReMiuM: An R Package for Profile Regression Mixture Models Using Dirichlet Processes.
J Stat Softw. 2015 Mar 20;64(7):1-30. doi: 10.18637/jss.v064.i07.
6
The potential of clustering methods to define intersection test scenarios: Assessing real-life performance of AEB.
Accid Anal Prev. 2018 Apr;113:1-11. doi: 10.1016/j.aap.2018.01.010. Epub 2018 Jan 30.
7
Clustering high-dimensional mixed data to uncover sub-phenotypes: joint analysis of phenotypic and genotypic data.
Stat Med. 2017 Dec 10;36(28):4548-4569. doi: 10.1002/sim.7371. Epub 2017 Jun 30.
9
Bayesian approaches to variable selection in mixture models with application to disease clustering.
J Appl Stat. 2021 Oct 28;50(2):387-407. doi: 10.1080/02664763.2021.1994529. eCollection 2023.
10
Nonlinear Joint Latent Variable Models and Integrative Tumor Subtype Discovery.
Stat Anal Data Min. 2016 Apr;9(2):106-116. doi: 10.1002/sam.11306. Epub 2016 Mar 28.

引用本文的文献

2
Determining County-Level Counterfactuals for Evaluation of Population Health Interventions: A Novel Application of -Means Cluster Analysis.
Public Health Rep. 2022 Sep-Oct;137(5):849-859. doi: 10.1177/00333549211030507. Epub 2021 Jul 29.
3
Challenges of Modeling Outcomes for Surgical Infections: A Word of Caution.
Surg Infect (Larchmt). 2021 Jun;22(5):523-531. doi: 10.1089/sur.2020.208. Epub 2020 Oct 20.

本文引用的文献

1
Simplex Factor Models for Multivariate Unordered Categorical Data.
J Am Stat Assoc. 2012 Mar 1;107(497):362-377. doi: 10.1080/01621459.2011.646934.
2
Nonparametric Bayes Conditional Distribution Modeling With Variable Selection.
J Am Stat Assoc. 2009 Dec 1;104(488):1646-1660. doi: 10.1198/jasa.2009.tm08302.
3
MissForest--non-parametric missing value imputation for mixed-type data.
Bioinformatics. 2012 Jan 1;28(1):112-8. doi: 10.1093/bioinformatics/btr597. Epub 2011 Oct 28.
4
A framework for feature selection in clustering.
J Am Stat Assoc. 2010 Jun 1;105(490):713-726. doi: 10.1198/jasa.2010.tm09415.
6
Bayesian Analysis of Multivariate Nominal Measures Using Multivariate Multinomial Probit Models.
Comput Stat Data Anal. 2008 Mar 15;52(7):3697-3708. doi: 10.1016/j.csda.2007.12.012.
7
Variable selection for clustering with Gaussian mixture models.
Biometrics. 2009 Sep;65(3):701-9. doi: 10.1111/j.1541-0420.2008.01160.x. Epub 2009 Feb 4.
8
Variable selection in penalized model-based clustering via regularization on grouped parameters.
Biometrics. 2008 Sep;64(3):921-930. doi: 10.1111/j.1541-0420.2007.00955.x. Epub 2007 Dec 20.
9
Variable selection for model-based high-dimensional clustering and its application to microarray data.
Biometrics. 2008 Jun;64(2):440-8. doi: 10.1111/j.1541-0420.2007.00922.x. Epub 2007 Oct 26.
10
Multivariate probit analysis: a neglected procedure in medical statistics.
Stat Med. 1991 Sep;10(9):1391-403. doi: 10.1002/sim.4780100907.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验