Centre for Research in Environmental Epidemiology, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain.
Am J Epidemiol. 2013 Apr 1;177(7):718-25. doi: 10.1093/aje/kws289. Epub 2013 Feb 27.
Multiple imputation is a common technique for dealing with missing values and is mostly applied in regression settings. Its application in cluster analysis problems, where the main objective is to classify individuals into homogenous groups, involves several difficulties which are not well characterized in the current literature. In this paper, we propose a framework for applying multiple imputation to cluster analysis when the original data contain missing values. The proposed framework incorporates the selection of the final number of clusters and a variable reduction procedure, which may be needed in data sets where the ratio of the number of persons to the number of variables is small. We suggest some ways to report how the uncertainty due to multiple imputation of missing data affects the cluster analysis outcomes-namely the final number of clusters, the results of a variable selection procedure (if applied), and the assignment of individuals to clusters. The proposed framework is illustrated with data from the Phenotype and Course of Chronic Obstructive Pulmonary Disease (PAC-COPD) Study (Spain, 2004-2008), which aimed to classify patients with chronic obstructive pulmonary disease into different disease subtypes.
多重插补是处理缺失值的常用技术,主要应用于回归设置中。在聚类分析问题中,主要目标是将个体分类到同质组中,其应用涉及当前文献中未很好描述的几个困难。在本文中,我们提出了一种在原始数据存在缺失值时将多重插补应用于聚类分析的框架。所提出的框架包含最终聚类数量的选择和变量缩减过程,这在人员数量与变量数量之比较小的数据集中可能是必要的。我们建议了一些报告方法,说明由于缺失数据的多重插补引起的不确定性如何影响聚类分析结果,即最终聚类数量、变量选择过程的结果(如果应用)以及个体到聚类的分配。所提出的框架通过来自慢性阻塞性肺疾病表型和病程(PAC-COPD)研究(西班牙,2004-2008 年)的数据进行说明,该研究旨在将慢性阻塞性肺疾病患者分类为不同的疾病亚型。