Suppr超能文献

高维基因组数据的渐近条件奇异值分解

Asymptotic conditional singular value decomposition for high-dimensional genomic data.

作者信息

Leek Jeffrey T

机构信息

Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205-2179, USA.

出版信息

Biometrics. 2011 Jun;67(2):344-52. doi: 10.1111/j.1541-0420.2010.01455.x. Epub 2010 Jun 16.

Abstract

High-dimensional data, such as those obtained from a gene expression microarray or second generation sequencing experiment, consist of a large number of dependent features measured on a small number of samples. One of the key problems in genomics is the identification and estimation of factors that associate with many features simultaneously. Identifying the number of factors is also important for unsupervised statistical analyses such as hierarchical clustering. A conditional factor model is the most common model for many types of genomic data, ranging from gene expression, to single nucleotide polymorphisms, to methylation. Here we show that under a conditional factor model for genomic data with a fixed sample size, the right singular vectors are asymptotically consistent for the unobserved latent factors as the number of features diverges. We also propose a consistent estimator of the dimension of the underlying conditional factor model for a finite fixed sample size and an infinite number of features based on a scaled eigen-decomposition. We propose a practical approach for selection of the number of factors in real data sets, and we illustrate the utility of these results for capturing batch and other unmodeled effects in a microarray experiment using the dependence kernel approach of Leek and Storey (2008, Proceedings of the National Academy of Sciences of the United States of America 105, 18718-18723).

摘要

高维数据,例如从基因表达微阵列或第二代测序实验中获得的数据,由在少量样本上测量的大量相关特征组成。基因组学中的关键问题之一是同时识别和估计与许多特征相关的因素。识别因素的数量对于诸如层次聚类等无监督统计分析也很重要。条件因子模型是许多类型基因组数据中最常见的模型,从基因表达、单核苷酸多态性到甲基化。在这里我们表明,在具有固定样本量的基因组数据的条件因子模型下,随着特征数量的增加,右奇异向量对于未观察到的潜在因子渐近一致。我们还基于缩放特征分解,为有限固定样本量和无限数量的特征提出了潜在条件因子模型维度的一致估计量。我们提出了一种在实际数据集中选择因子数量的实用方法,并使用Leek和Storey(2008年,《美国国家科学院院刊》105, 18718 - 18723)的依赖核方法,说明了这些结果在捕获微阵列实验中的批次和其他未建模效应方面的效用。

相似文献

1
6
The latent process decomposition of cDNA microarray data sets.cDNA微阵列数据集的潜在过程分解
IEEE/ACM Trans Comput Biol Bioinform. 2005 Apr-Jun;2(2):143-56. doi: 10.1109/TCBB.2005.29.

引用本文的文献

本文引用的文献

1
FALSE DISCOVERY RATE ANALYSIS OF BRAIN DIFFUSION DIRECTION MAPS.脑扩散方向图的错误发现率分析
Ann Appl Stat. 2008 Mar;2(1):153-175. doi: 10.1214/07-AOAS133. Epub 2008 Mar 24.
2
Remarks on Parallel Analysis.关于平行分析的评论
Multivariate Behav Res. 1992 Oct 1;27(4):509-40. doi: 10.1207/s15327906mbr2704_2.
4
A general framework for multiple testing dependence.多重检验相关性的通用框架。
Proc Natl Acad Sci U S A. 2008 Dec 2;105(48):18718-23. doi: 10.1073/pnas.0808709105. Epub 2008 Nov 24.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验