存在混合变量类型和缺失数据时的聚类与变量选择

Clustering and variable selection in the presence of mixed variable types and missing data.

作者信息

Storlie C B, Myers S M, Katusic S K, Weaver A L, Voigt R G, Croarkin P E, Stoeckel R E, Port J D

机构信息

Mayo Clinic, Rochester, USA.

Geisinger Autism & Developmental Medicine Institute, Lewisburg, USA.

出版信息

Stat Med. 2018 May 17. doi: 10.1002/sim.7697.

DOI:10.1002/sim.7697

PMID:29774571

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6240391/

Abstract

We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.

摘要

我们考虑在存在许多相关的、混合的连续和离散变量（其中一些可能存在缺失值）的情况下基于模型的聚类问题。离散变量采用潜在连续变量方法处理，狄利克雷过程用于构建具有未知数量成分的混合模型。还进行变量选择以识别对确定聚类成员最有影响的变量。这项工作的动机源于需要根据许多认知和/或行为测试分数对被认为可能患有自闭症谱系障碍的患者进行聚类。数据集中有数量适中的患者（486名）以及许多（55个）测试分数变量（其中许多是离散值和/或缺失值）。这项工作的目标是：（1）将这些患者聚类为相似的组，以帮助识别具有相似临床表现的患者；（2）识别为聚类提供信息的稀疏测试子集，以消除不必要的测试。通过对这类问题的模拟，所提出的方法与其他方法相比具有很大优势。自闭症谱系障碍分析的结果表明最有可能分为3个聚类，而只有4个测试分数具有较高（>0.5）的后验概率表明其具有信息性。这将导致更高效和更具信息性的测试。基于许多相关的、具有缺失值的连续/离散变量对观测值进行聚类的需求在健康科学以及许多其他学科中都是一个常见问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cc18/6240391/41a36f097e14/nihms976265f1.jpg

相似文献

Clustering and variable selection in the presence of mixed variable types and missing data.存在混合变量类型和缺失数据时的聚类与变量选择

Stat Med. 2018 May 17. doi: 10.1002/sim.7697.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering.从这里到无穷：基于模型的聚类中稀疏有限混合模型与狄利克雷过程混合模型

Adv Data Anal Classif. 2019;13(1):33-64. doi: 10.1007/s11634-018-0329-y. Epub 2018 Aug 24.

Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.第1部分. 多种空气污染成分影响的统计学习方法

Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.

PReMiuM: An R Package for Profile Regression Mixture Models Using Dirichlet Processes.PReMiuM：一个使用狄利克雷过程的轮廓回归混合模型的R包。

J Stat Softw. 2015 Mar 20;64(7):1-30. doi: 10.18637/jss.v064.i07.

The potential of clustering methods to define intersection test scenarios: Assessing real-life performance of AEB.聚类方法在定义交叉口测试场景中的潜力：评估 AEB 的实际性能。

Accid Anal Prev. 2018 Apr;113:1-11. doi: 10.1016/j.aap.2018.01.010. Epub 2018 Jan 30.

Clustering high-dimensional mixed data to uncover sub-phenotypes: joint analysis of phenotypic and genotypic data.

Stat Med. 2017 Dec 10;36(28):4548-4569. doi: 10.1002/sim.7371. Epub 2017 Jun 30.

Sequential analysis of latent variables using mixed-effect latent variable models: Impact of non-informative and informative missing data.使用混合效应潜变量模型对潜变量进行序列分析：无信息和有信息缺失数据的影响

Stat Med. 2007 Nov 30;26(27):4889-904. doi: 10.1002/sim.2959.

Bayesian approaches to variable selection in mixture models with application to disease clustering.贝叶斯方法在混合模型中的变量选择及其在疾病聚类中的应用

J Appl Stat. 2021 Oct 28;50(2):387-407. doi: 10.1080/02664763.2021.1994529. eCollection 2023.

Nonlinear Joint Latent Variable Models and Integrative Tumor Subtype Discovery.非线性联合潜在变量模型与整合肿瘤亚型发现

Stat Anal Data Min. 2016 Apr;9(2):106-116. doi: 10.1002/sam.11306. Epub 2016 Mar 28.

引用本文的文献

Clustering Methods in Rheumatic and Musculoskeletal Disease Research: An Educational Guide to Best Research Practices.聚类方法在风湿和肌肉骨骼疾病研究中的应用：最佳研究实践的教育指南。

J Rheumatol. 2024 Dec 1;51(12):1160-1168. doi: 10.3899/jrheum.2024-0519.

Determining County-Level Counterfactuals for Evaluation of Population Health Interventions: A Novel Application of -Means Cluster Analysis.确定县级反事实以评估人口健康干预措施：-均值聚类分析的新应用。

Public Health Rep. 2022 Sep-Oct;137(5):849-859. doi: 10.1177/00333549211030507. Epub 2021 Jul 29.

Challenges of Modeling Outcomes for Surgical Infections: A Word of Caution.手术感染结局建模面临的挑战：谨慎为之。

Surg Infect (Larchmt). 2021 Jun;22(5):523-531. doi: 10.1089/sur.2020.208. Epub 2020 Oct 20.

Phenotypes Determined by Cluster Analysis and Their Survival in the Prospective European Scleroderma Trials and Research Cohort of Patients With Systemic Sclerosis.聚类分析确定的表型及其在系统性硬化症患者前瞻性欧洲硬皮病试验和研究队列中的生存情况。

Arthritis Rheumatol. 2019 Sep;71(9):1553-1570. doi: 10.1002/art.40906. Epub 2019 Aug 12.

本文引用的文献

Simplex Factor Models for Multivariate Unordered Categorical Data.多元无序分类数据的单纯形因子模型

J Am Stat Assoc. 2012 Mar 1;107(497):362-377. doi: 10.1080/01621459.2011.646934.

Nonparametric Bayes Conditional Distribution Modeling With Variable Selection.具有变量选择的非参数贝叶斯条件分布建模

J Am Stat Assoc. 2009 Dec 1;104(488):1646-1660. doi: 10.1198/jasa.2009.tm08302.

MissForest--non-parametric missing value imputation for mixed-type data.MissForest--用于混合类型数据的非参数缺失值插补。

Bioinformatics. 2012 Jan 1;28(1):112-8. doi: 10.1093/bioinformatics/btr597. Epub 2011 Oct 28.

A framework for feature selection in clustering.一种用于聚类中特征选择的框架。

J Am Stat Assoc. 2010 Jun 1;105(490):713-726. doi: 10.1198/jasa.2010.tm09415.

Variable selection in Bayesian smoothing spline ANOVA models: Application to deterministic computer codes.贝叶斯平滑样条方差分析模型中的变量选择：在确定性计算机代码中的应用。

Technometrics. 2009 May 1;51(2):110-120. doi: 10.1198/TECH.2009.0013.

Bayesian Analysis of Multivariate Nominal Measures Using Multivariate Multinomial Probit Models.使用多元多项概率单位模型对多元名义测度进行贝叶斯分析。

Comput Stat Data Anal. 2008 Mar 15;52(7):3697-3708. doi: 10.1016/j.csda.2007.12.012.

Variable selection for clustering with Gaussian mixture models.用于高斯混合模型聚类的变量选择

Biometrics. 2009 Sep;65(3):701-9. doi: 10.1111/j.1541-0420.2008.01160.x. Epub 2009 Feb 4.

Variable selection in penalized model-based clustering via regularization on grouped parameters.基于分组参数正则化的惩罚模型聚类中的变量选择

Biometrics. 2008 Sep;64(3):921-930. doi: 10.1111/j.1541-0420.2007.00955.x. Epub 2007 Dec 20.

Variable selection for model-based high-dimensional clustering and its application to microarray data.基于模型的高维聚类的变量选择及其在微阵列数据中的应用。

Biometrics. 2008 Jun;64(2):440-8. doi: 10.1111/j.1541-0420.2007.00922.x. Epub 2007 Oct 26.

Multivariate probit analysis: a neglected procedure in medical statistics.多变量概率单位分析：医学统计学中一个被忽视的方法。

Stat Med. 1991 Sep;10(9):1391-403. doi: 10.1002/sim.4780100907.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验