Li Gen, Jung Sungkyu
Department of Biostatistics, Mailman School of Public Health, Columbia University, New York 10032, New York, U.S.A.
Department of Statistics, University of Pittsburgh, Pittsburgh 15260, Pennsylvania, U.S.A.
Biometrics. 2017 Dec;73(4):1433-1442. doi: 10.1111/biom.12698. Epub 2017 Apr 13.
In modern biomedical research, it is ubiquitous to have multiple data sets measured on the same set of samples from different views (i.e., multi-view data). For example, in genetic studies, multiple genomic data sets at different molecular levels or from different cell types are measured for a common set of individuals to investigate genetic regulation. Integration and reduction of multi-view data have the potential to leverage information in different data sets, and to reduce the magnitude and complexity of data for further statistical analysis and interpretation. In this article, we develop a novel statistical model, called supervised integrated factor analysis (SIFA), for integrative dimension reduction of multi-view data while incorporating auxiliary covariates. The model decomposes data into joint and individual factors, capturing the joint variation across multiple data sets and the individual variation specific to each set, respectively. Moreover, both joint and individual factors are partially informed by auxiliary covariates via nonparametric models. We devise a computationally efficient Expectation-Maximization (EM) algorithm to fit the model under some identifiability conditions. We apply the method to the Genotype-Tissue Expression (GTEx) data, and provide new insights into the variation decomposition of gene expression in multiple tissues. Extensive simulation studies and an additional application to a pediatric growth study demonstrate the advantage of the proposed method over competing methods.
在现代生物医学研究中,对同一组样本从不同视角测量多个数据集(即多视图数据)的情况很普遍。例如,在遗传学研究中,针对同一组个体测量不同分子水平或来自不同细胞类型的多个基因组数据集,以研究基因调控。多视图数据的整合与降维有潜力利用不同数据集中的信息,并降低数据的规模和复杂性,以便进行进一步的统计分析和解读。在本文中,我们开发了一种名为监督集成因子分析(SIFA)的新型统计模型,用于在纳入辅助协变量的同时对多视图数据进行整合降维。该模型将数据分解为联合因子和个体因子,分别捕捉多个数据集之间的联合变异以及每个数据集特有的个体变异。此外,联合因子和个体因子都通过非参数模型部分地由辅助协变量提供信息。我们设计了一种计算效率高的期望最大化(EM)算法,在一些可识别性条件下拟合该模型。我们将该方法应用于基因型 - 组织表达(GTEx)数据,并对多个组织中基因表达的变异分解提供了新的见解。广泛的模拟研究以及在一项儿科生长研究中的额外应用证明了所提出方法相对于竞争方法的优势。