高维基因组数据的渐近条件奇异值分解

Asymptotic conditional singular value decomposition for high-dimensional genomic data.

作者信息

Leek Jeffrey T

机构信息

Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205-2179, USA.

出版信息

Biometrics. 2011 Jun;67(2):344-52. doi: 10.1111/j.1541-0420.2010.01455.x. Epub 2010 Jun 16.

DOI:10.1111/j.1541-0420.2010.01455.x

PMID:20560929

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3165001/

Abstract

High-dimensional data, such as those obtained from a gene expression microarray or second generation sequencing experiment, consist of a large number of dependent features measured on a small number of samples. One of the key problems in genomics is the identification and estimation of factors that associate with many features simultaneously. Identifying the number of factors is also important for unsupervised statistical analyses such as hierarchical clustering. A conditional factor model is the most common model for many types of genomic data, ranging from gene expression, to single nucleotide polymorphisms, to methylation. Here we show that under a conditional factor model for genomic data with a fixed sample size, the right singular vectors are asymptotically consistent for the unobserved latent factors as the number of features diverges. We also propose a consistent estimator of the dimension of the underlying conditional factor model for a finite fixed sample size and an infinite number of features based on a scaled eigen-decomposition. We propose a practical approach for selection of the number of factors in real data sets, and we illustrate the utility of these results for capturing batch and other unmodeled effects in a microarray experiment using the dependence kernel approach of Leek and Storey (2008, Proceedings of the National Academy of Sciences of the United States of America 105, 18718-18723).

摘要

高维数据，例如从基因表达微阵列或第二代测序实验中获得的数据，由在少量样本上测量的大量相关特征组成。基因组学中的关键问题之一是同时识别和估计与许多特征相关的因素。识别因素的数量对于诸如层次聚类等无监督统计分析也很重要。条件因子模型是许多类型基因组数据中最常见的模型，从基因表达、单核苷酸多态性到甲基化。在这里我们表明，在具有固定样本量的基因组数据的条件因子模型下，随着特征数量的增加，右奇异向量对于未观察到的潜在因子渐近一致。我们还基于缩放特征分解，为有限固定样本量和无限数量的特征提出了潜在条件因子模型维度的一致估计量。我们提出了一种在实际数据集中选择因子数量的实用方法，并使用Leek和Storey（2008年，《美国国家科学院院刊》105, 18718 - 18723）的依赖核方法，说明了这些结果在捕获微阵列实验中的批次和其他未建模效应方面的效用。

相似文献

Asymptotic conditional singular value decomposition for high-dimensional genomic data.高维基因组数据的渐近条件奇异值分解

Biometrics. 2011 Jun;67(2):344-52. doi: 10.1111/j.1541-0420.2010.01455.x. Epub 2010 Jun 16.

Identification of nutrient partitioning genes participating in rice grain filling by singular value decomposition (SVD) of genome expression data.通过基因组表达数据的奇异值分解（SVD）鉴定参与水稻籽粒灌浆的养分分配基因。

BMC Genomics. 2003 Jul 10;4(1):26. doi: 10.1186/1471-2164-4-26.

Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.使用微阵列基因表达数据的用于疾病分类的核嵌入高斯过程。

BMC Bioinformatics. 2007 Feb 28;8:67. doi: 10.1186/1471-2105-8-67.

Fractal dimension and wavelet decomposition for robust microarray data clustering.用于稳健微阵列数据聚类的分形维数与小波分解

Annu Int Conf IEEE Eng Med Biol Soc. 2008;2008:4106-9. doi: 10.1109/IEMBS.2008.4650112.

A process for analysis of microarray comparative genomics hybridisation studies for bacterial genomes.一种用于细菌基因组微阵列比较基因组杂交研究的分析方法。

BMC Genomics. 2008 Jan 29;9:53. doi: 10.1186/1471-2164-9-53.

The latent process decomposition of cDNA microarray data sets.cDNA微阵列数据集的潜在过程分解

IEEE/ACM Trans Comput Biol Bioinform. 2005 Apr-Jun;2(2):143-56. doi: 10.1109/TCBB.2005.29.

Singular value decomposition regression models for classification of tumors from microarray experiments.用于从微阵列实验中对肿瘤进行分类的奇异值分解回归模型。

Pac Symp Biocomput. 2002:18-29.

Statistical aspects of omics data analysis using the random compound covariate.使用随机复合协变量进行组学数据分析的统计学方面。

BMC Syst Biol. 2012;6 Suppl 3(Suppl 3):S11. doi: 10.1186/1752-0509-6-S3-S11. Epub 2012 Dec 17.

Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering.在依赖结构的弱条件下比较大型协方差矩阵及其在基因聚类中的应用。

Biometrics. 2017 Mar;73(1):31-41. doi: 10.1111/biom.12552. Epub 2016 Jul 5.

Estimating genomic coexpression networks using first-order conditional independence.使用一阶条件独立性估计基因组共表达网络。

Genome Biol. 2004;5(12):R100. doi: 10.1186/gb-2004-5-12-r100. Epub 2004 Nov 30.

引用本文的文献

Complimentary vertebrate models exhibit phenotypes relevant to DeSanto-Shinawi Syndrome.互补脊椎动物模型表现出与德桑托-希纳维综合征相关的表型。

bioRxiv. 2024 Nov 24:2024.05.26.595966. doi: 10.1101/2024.05.26.595966.

Epigenome-wide association study of peripheral immune cell populations in Parkinson's disease.帕金森病外周免疫细胞群体的全表观基因组关联研究。

NPJ Parkinsons Dis. 2023 Oct 31;9(1):149. doi: 10.1038/s41531-023-00594-x.

Plasma proteomics of SARS-CoV-2 infection and severity reveals impact on Alzheimer's and coronary disease pathways.新型冠状病毒感染与严重程度的血浆蛋白质组学揭示了对阿尔茨海默病和冠心病通路的影响。

iScience. 2023 Apr 21;26(4):106408. doi: 10.1016/j.isci.2023.106408. Epub 2023 Mar 14.

The relationship between case-control differential gene expression from brain tissue and genetic associations in schizophrenia.精神分裂症脑组织病例对照差异基因表达与遗传关联之间的关系。

Am J Med Genet B Neuropsychiatr Genet. 2023 Jul-Sep;192(5-6):85-92. doi: 10.1002/ajmg.b.32931. Epub 2023 Jan 18.

Maternal Periconceptional Folic Acid Supplementation and DNA Methylation Patterns in Adolescent Offspring.母体受孕前叶酸补充与青少年后代的 DNA 甲基化模式。

J Nutr. 2023 Jan 14;152(12):2669-2676. doi: 10.1093/jn/nxac184.

Plasma proteomics of SARS-CoV-2 infection and severity reveals impact on Alzheimer and coronary disease pathways.新型冠状病毒感染与严重程度的血浆蛋白质组学揭示了对阿尔茨海默病和冠心病通路的影响。

medRxiv. 2022 Jul 25:2022.07.25.22278025. doi: 10.1101/2022.07.25.22278025.

Physical geography, isolation by distance and environmental variables shape genomic variation of wild barley (Hordeum vulgare L. ssp. spontaneum) in the Southern Levant.自然地理学、距离隔离和环境变量塑造了南黎凡特地区野生大麦（Hordeum vulgare L. ssp. spontaneum）的基因组变异。

Heredity (Edinb). 2022 Feb;128(2):107-119. doi: 10.1038/s41437-021-00494-x. Epub 2022 Jan 11.

Temporal Dynamic Methods for Bulk RNA-Seq Time Series Data.批量 RNA-Seq 时间序列数据的时间动态方法。

Genes (Basel). 2021 Feb 27;12(3):352. doi: 10.3390/genes12030352.

Differential gene expression data from the human central nervous system across Alzheimer's disease, Lewy body diseases, and the amyotrophic lateral sclerosis and frontotemporal dementia spectrum.来自人类中枢神经系统的差异基因表达数据，涵盖阿尔茨海默病、路易体病以及肌萎缩侧索硬化症和额颞叶痴呆谱系。

Data Brief. 2021 Feb 11;35:106863. doi: 10.1016/j.dib.2021.106863. eCollection 2021 Apr.

Gene co-expression networks in peripheral blood capture dimensional measures of emotional and behavioral problems from the Child Behavior Checklist (CBCL).外周血基因共表达网络可从儿童行为检查表 (CBCL) 中捕捉情绪和行为问题的维度测量指标。

Transl Psychiatry. 2020 Sep 23;10(1):328. doi: 10.1038/s41398-020-01007-w.

本文引用的文献

FALSE DISCOVERY RATE ANALYSIS OF BRAIN DIFFUSION DIRECTION MAPS.脑扩散方向图的错误发现率分析

Ann Appl Stat. 2008 Mar;2(1):153-175. doi: 10.1214/07-AOAS133. Epub 2008 Mar 24.

Remarks on Parallel Analysis.关于平行分析的评论

Multivariate Behav Res. 1992 Oct 1;27(4):509-40. doi: 10.1207/s15327906mbr2704_2.

A unified statistical approach for determining significant signals in images of cerebral activation.一种用于确定大脑激活图像中显著信号的统一统计方法。

Hum Brain Mapp. 1996;4(1):58-73. doi: 10.1002/(SICI)1097-0193(1996)4:1<58::AID-HBM4>3.0.CO;2-O.

A general framework for multiple testing dependence.多重检验相关性的通用框架。

Proc Natl Acad Sci U S A. 2008 Dec 2;105(48):18718-23. doi: 10.1073/pnas.0808709105. Epub 2008 Nov 24.

Capturing heterogeneity in gene expression studies by surrogate variable analysis.通过替代变量分析在基因表达研究中捕捉异质性。

PLoS Genet. 2007 Sep;3(9):1724-35. doi: 10.1371/journal.pgen.0030161. Epub 2007 Aug 1.

Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels.全基因组关联分析确定2型糖尿病和甘油三酯水平的基因座。

Science. 2007 Jun 1;316(5829):1331-6. doi: 10.1126/science.1142358. Epub 2007 Apr 26.

Quantitative high-throughput screening: a titration-based approach that efficiently identifies biological activities in large chemical libraries.定量高通量筛选：一种基于滴定的方法，可有效识别大型化学文库中的生物活性。

Proc Natl Acad Sci U S A. 2006 Aug 1;103(31):11473-8. doi: 10.1073/pnas.0604348103. Epub 2006 Jul 24.

Principal components analysis corrects for stratification in genome-wide association studies.主成分分析可校正全基因组关联研究中的分层现象。

Nat Genet. 2006 Aug;38(8):904-9. doi: 10.1038/ng1847. Epub 2006 Jul 23.

Adjusting batch effects in microarray expression data using empirical Bayes methods.使用经验贝叶斯方法调整微阵列表达数据中的批次效应。

Biostatistics. 2007 Jan;8(1):118-27. doi: 10.1093/biostatistics/kxj037. Epub 2006 Apr 21.

Mapping complex disease loci in whole-genome association studies.全基因组关联研究中的复杂疾病基因座定位

Nature. 2004 May 27;429(6990):446-52. doi: 10.1038/nature02623.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。