Leek Jeffrey T
Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205-2179, USA.
Biometrics. 2011 Jun;67(2):344-52. doi: 10.1111/j.1541-0420.2010.01455.x. Epub 2010 Jun 16.
High-dimensional data, such as those obtained from a gene expression microarray or second generation sequencing experiment, consist of a large number of dependent features measured on a small number of samples. One of the key problems in genomics is the identification and estimation of factors that associate with many features simultaneously. Identifying the number of factors is also important for unsupervised statistical analyses such as hierarchical clustering. A conditional factor model is the most common model for many types of genomic data, ranging from gene expression, to single nucleotide polymorphisms, to methylation. Here we show that under a conditional factor model for genomic data with a fixed sample size, the right singular vectors are asymptotically consistent for the unobserved latent factors as the number of features diverges. We also propose a consistent estimator of the dimension of the underlying conditional factor model for a finite fixed sample size and an infinite number of features based on a scaled eigen-decomposition. We propose a practical approach for selection of the number of factors in real data sets, and we illustrate the utility of these results for capturing batch and other unmodeled effects in a microarray experiment using the dependence kernel approach of Leek and Storey (2008, Proceedings of the National Academy of Sciences of the United States of America 105, 18718-18723).
高维数据,例如从基因表达微阵列或第二代测序实验中获得的数据,由在少量样本上测量的大量相关特征组成。基因组学中的关键问题之一是同时识别和估计与许多特征相关的因素。识别因素的数量对于诸如层次聚类等无监督统计分析也很重要。条件因子模型是许多类型基因组数据中最常见的模型,从基因表达、单核苷酸多态性到甲基化。在这里我们表明,在具有固定样本量的基因组数据的条件因子模型下,随着特征数量的增加,右奇异向量对于未观察到的潜在因子渐近一致。我们还基于缩放特征分解,为有限固定样本量和无限数量的特征提出了潜在条件因子模型维度的一致估计量。我们提出了一种在实际数据集中选择因子数量的实用方法,并使用Leek和Storey(2008年,《美国国家科学院院刊》105, 18718 - 18723)的依赖核方法,说明了这些结果在捕获微阵列实验中的批次和其他未建模效应方面的效用。