Liu Binghui, Shen Xiaotong, Pan Wei
School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024 Jilin Province, China.
School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA.
Stat Anal Data Min. 2016 Apr;9(2):106-116. doi: 10.1002/sam.11306. Epub 2016 Mar 28.
Integrative analysis has been used to identify clusters by integrating data of disparate types, such as deoxyribonucleic acid (DNA) copy number alterations and DNA methylation changes for discovering novel subtypes of tumors. Most existing integrative analysis methods are based on joint latent variable models, which are generally divided into two classes: joint factor analysis and joint mixture modeling, with continuous and discrete parameterizations of the latent variables respectively. Despite recent progresses, many issues remain. In particular, existing integration methods based on joint factor analysis may be inadequate to model multiple clusters due to the unimodality of the assumed Gaussian distribution, while those based on joint mixture modeling may not have the ability for dimension reduction and/or feature selection. In this paper, we employ a nonlinear joint latent variable model to allow for flexible modeling that can account for multiple clusters as well as conduct dimension reduction and feature selection. We propose a method, called integrative and regularized generative topographic mapping (irGTM), to perform simultaneous dimension reduction across multiple types of data while achieving feature selection separately for each data type. Simulations are performed to examine the operating characteristics of the methods, in which the proposed method compares favorably against the popular iCluster that is based on a linear joint latent variable model. Finally, a glioblastoma multiforme (GBM) dataset is examined.
整合分析已被用于通过整合不同类型的数据(如脱氧核糖核酸(DNA)拷贝数改变和DNA甲基化变化)来识别聚类,以发现肿瘤的新亚型。大多数现有的整合分析方法基于联合潜在变量模型,这些模型通常分为两类:联合因子分析和联合混合建模,潜在变量分别具有连续和离散的参数化。尽管最近取得了进展,但仍存在许多问题。特别是,基于联合因子分析的现有整合方法可能由于假定的高斯分布的单峰性而不足以对多个聚类进行建模,而基于联合混合建模的方法可能没有降维和/或特征选择的能力。在本文中,我们采用非线性联合潜在变量模型以实现灵活建模,该模型可以考虑多个聚类,并进行降维和特征选择。我们提出了一种称为整合正则化生成地形映射(irGTM)的方法,以在跨多种类型数据进行同时降维的同时,为每种数据类型分别实现特征选择。进行了模拟以检验这些方法的操作特性,其中所提出的方法与基于线性联合潜在变量模型的流行iCluster相比具有优势。最后,研究了一个多形性胶质母细胞瘤(GBM)数据集。