相关z值与大规模统计估计的准确性。

Correlated z-values and the accuracy of large-scale statistical estimates.

作者信息

Efron Bradley

机构信息

Department of Statistics, Stanford University.

出版信息

J Am Stat Assoc. 2010 Sep 1;105(491):1042-1055. doi: 10.1198/jasa.2010.tm09129.

DOI:10.1198/jasa.2010.tm09129

PMID:21052523

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2967047/

Abstract

We consider large-scale studies in which there are hundreds or thousands of correlated cases to investigate, each represented by its own normal variate, typically a z-value. A familiar example is provided by a microarray experiment comparing healthy with sick subjects' expression levels for thousands of genes. This paper concerns the accuracy of summary statistics for the collection of normal variates, such as their empirical cdf or a false discovery rate statistic. It seems like we must estimate an N by N correlation matrix, N the number of cases, but our main result shows that this is not necessary: good accuracy approximations can be based on the root mean square correlation over all N · (N - 1)/2 pairs, a quantity often easily estimated. A second result shows that z-values closely follow normal distributions even under non-null conditions, supporting application of the main theorem. Practical application of the theory is illustrated for a large leukemia microarray study.

摘要

我们考虑进行大规模研究，其中有数百或数千个相关病例需要调查，每个病例都由其自身的正态变量表示，通常是一个z值。一个常见的例子是微阵列实验，该实验比较了健康受试者和患病受试者数千个基因的表达水平。本文关注正态变量集合的汇总统计量的准确性，例如它们的经验累积分布函数或错误发现率统计量。似乎我们必须估计一个N×N的相关矩阵，N为病例数，但我们的主要结果表明这是不必要的：良好的准确性近似可以基于所有N·(N - 1)/2对的均方根相关性，这一量通常很容易估计。第二个结果表明，即使在非零条件下，z值也紧密遵循正态分布，这支持了主定理的应用。针对一项大型白血病微阵列研究说明了该理论的实际应用。

相似文献

J Am Stat Assoc. 2010 Sep 1;105(491):1042-1055. doi: 10.1198/jasa.2010.tm09129.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Selection of differentially expressed genes in microarray data analysis.

Pharmacogenomics J. 2007 Jun;7(3):212-20. doi: 10.1038/sj.tpj.6500412. Epub 2006 Aug 29.

Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent.

BMC Genomics. 2007 Apr 19;8:105. doi: 10.1186/1471-2164-8-105.

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations.

J R Stat Soc Series B Stat Methodol. 2012 Sep;74(4):721-743. doi: 10.1111/j.1467-9868.2011.01027.x. Epub 2012 Mar 16.

Estimation of false discovery rates in multiple testing: application to gene microarray data.

Biometrics. 2003 Dec;59(4):1071-81. doi: 10.1111/j.0006-341x.2003.00123.x.

Improving the statistical detection of regulated genes from microarray data using intensity-based variance estimation.

BMC Genomics. 2004 Feb 27;5(1):17. doi: 10.1186/1471-2164-5-17.

Empirical Bayes screening of many p-values with applications to microarray studies.

Bioinformatics. 2005 May 1;21(9):1987-94. doi: 10.1093/bioinformatics/bti301. Epub 2005 Feb 2.

Personal exposure to mixtures of volatile organic compounds: modeling and further analysis of the RIOPA data.

Res Rep Health Eff Inst. 2014 Jun(181):3-63.

Re-sampling strategy to improve the estimation of number of null hypotheses in FDR control under strong correlation structures.

BMC Bioinformatics. 2007 May 18;8:157. doi: 10.1186/1471-2105-8-157.

引用本文的文献

UNIFYING AND GENERALIZING METHODS FOR REMOVING UNWANTED VARIATION BASED ON NEGATIVE CONTROLS.

Stat Sin. 2021 Jul;31(3):1145-1166. doi: 10.5705/ss.202018.0345.

Discussion of "Confidence Intervals for Nonparametric Empirical Bayes Analysis".

J Am Stat Assoc. 2022;117(539):1186-1191. doi: 10.1080/01621459.2022.2093727. Epub 2022 Sep 12.

A Test to Distinguish Monotone Homogeneity from Monotone Multifactor Models.

Psychometrika. 2023 Jun;88(2):387-412. doi: 10.1007/s11336-023-09905-w. Epub 2023 Mar 18.

Distinct sex-specific DNA methylation differences in Alzheimer's disease.

Alzheimers Res Ther. 2022 Sep 15;14(1):133. doi: 10.1186/s13195-022-01070-z.

A resource for integrated genomic analysis of the human liver.

Sci Rep. 2022 Sep 7;12(1):15151. doi: 10.1038/s41598-022-18506-z.

Large-Scale Hypothesis Testing for Causal Mediation Effects with Applications in Genome-wide Epigenetic Studies.

J Am Stat Assoc. 2022;117(537):67-81. doi: 10.1080/01621459.2021.1914634. Epub 2021 May 19.

Cross-tissue analysis of blood and brain epigenome-wide association studies in Alzheimer's disease.

Nat Commun. 2022 Aug 18;13(1):4852. doi: 10.1038/s41467-022-32475-x.

MethReg: estimating the regulatory potential of DNA methylation in gene transcription.

Nucleic Acids Res. 2022 May 20;50(9):e51. doi: 10.1093/nar/gkac030.

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations.

J R Stat Soc Series B Stat Methodol. 2012 Sep;74(4):721-743. doi: 10.1111/j.1467-9868.2011.01027.x. Epub 2012 Mar 16.

Robust high dimensional factor models with applications to statistical machine learning.

Stat Sci. 2021 May;36(2):303-327. doi: 10.1214/20-sts785. Epub 2021 Apr 19.

本文引用的文献

Are a set of microarrays independent of each other?

Ann Appl Stat. 2009 Jan 1;3(3):922-942. doi: 10.1214/09-AOAS236.

Correlation between gene expression levels and limitations of the empirical bayes methodology for finding differentially expressed genes.

Stat Appl Genet Mol Biol. 2005;4:Article34. doi: 10.2202/1544-6115.1157. Epub 2005 Nov 22.

Multiple testing. Part I. Single-step procedures for control of general type I error rates.

Stat Appl Genet Mol Biol. 2004;3:Article13. doi: 10.2202/1544-6115.1040. Epub 2004 Jun 9.

The effects of normalization on the correlation structure of microarray data.

BMC Bioinformatics. 2005 May 16;6:120. doi: 10.1186/1471-2105-6-120.

A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.

Bioinformatics. 2003 Jan 22;19(2):185-93. doi: 10.1093/bioinformatics/19.2.185.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Science. 1999 Oct 15;286(5439):531-7. doi: 10.1126/science.286.5439.531.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。