Suppr超能文献

利用对照来限制大数据时代的假发现。

Using controls to limit false discovery in the era of big data.

机构信息

Department of Physiology and Biophysics, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA.

Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ, 08540, USA.

出版信息

BMC Bioinformatics. 2018 Sep 14;19(1):323. doi: 10.1186/s12859-018-2356-2.

Abstract

BACKGROUND

Procedures for controlling the false discovery rate (FDR) are widely applied as a solution to the multiple comparisons problem of high-dimensional statistics. Current FDR-controlling procedures require accurately calculated p-values and rely on extrapolation into the unknown and unobserved tails of the null distribution. Both of these intermediate steps are challenging and can compromise the reliability of the results.

RESULTS

We present a general method for controlling the FDR that capitalizes on the large amount of control data often found in big data studies to avoid these frequently problematic intermediate steps. The method utilizes control data to empirically construct the distribution of the test statistic under the null hypothesis and directly compares this distribution to the empirical distribution of the test data. By not relying on p-values, our control data-based empirical FDR procedure more closely follows the foundational principles of the scientific method: that inference is drawn by comparing test data to control data. The method is demonstrated through application to a problem in structural genomics.

CONCLUSIONS

The method described here provides a general statistical framework for controlling the FDR that is specifically tailored for the big data setting. By relying on empirically constructed distributions and control data, it forgoes potentially problematic modeling steps and extrapolation into the unknown tails of the null distribution. This procedure is broadly applicable insofar as controlled experiments or internal negative controls are available, as is increasingly common in the big data setting.

摘要

背景

控制错误发现率(FDR)的程序被广泛应用于解决高维统计的多次比较问题。当前的 FDR 控制程序需要准确计算 p 值,并依赖于对未知和未观察到的零分布尾部的外推。这两个中间步骤都具有挑战性,并且可能会影响结果的可靠性。

结果

我们提出了一种控制 FDR 的通用方法,该方法利用大数据研究中经常发现的大量控制数据来避免这些经常出现问题的中间步骤。该方法利用控制数据经验构建零假设下检验统计量的分布,并直接将该分布与检验数据的经验分布进行比较。通过不依赖 p 值,我们基于控制数据的经验 FDR 程序更紧密地遵循了科学方法的基本原则:即通过将检验数据与控制数据进行比较来进行推断。该方法通过在结构基因组学中的一个问题上的应用得到了验证。

结论

这里描述的方法提供了一种专门针对大数据环境的控制 FDR 的通用统计框架。通过依赖经验构建的分布和控制数据,它避免了潜在的有问题的建模步骤和对零分布未知尾部的外推。只要有控制实验或内部负对照可用,这种方法就具有广泛的适用性,这在大数据环境中越来越常见。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51d4/6137876/90466ac38682/12859_2018_2356_Fig1_HTML.jpg

相似文献

1
Using controls to limit false discovery in the era of big data.
BMC Bioinformatics. 2018 Sep 14;19(1):323. doi: 10.1186/s12859-018-2356-2.
2
A new estimation of protein-level false discovery rate.
BMC Genomics. 2018 Aug 13;19(Suppl 6):567. doi: 10.1186/s12864-018-4923-3.
4
Statistical detection of EEG synchrony using empirical bayesian inference.
PLoS One. 2015 Mar 30;10(3):e0121795. doi: 10.1371/journal.pone.0121795. eCollection 2015.
5
Empirical Bayes screening of many p-values with applications to microarray studies.
Bioinformatics. 2005 May 1;21(9):1987-94. doi: 10.1093/bioinformatics/bti301. Epub 2005 Feb 2.
6
An optimal test with maximum average power while controlling FDR with application to RNA-seq data.
Biometrics. 2013 Sep;69(3):594-605. doi: 10.1111/biom.12036. Epub 2013 Jul 26.
8
Application of false discovery rate procedure to pairwise comparisons of refractive index of glass fragments.
Forensic Sci Int. 2006 Jun 27;160(1):53-8. doi: 10.1016/j.forsciint.2005.08.016. Epub 2005 Oct 10.
10
False discovery rates: a new deal.
Biostatistics. 2017 Apr 1;18(2):275-294. doi: 10.1093/biostatistics/kxw041.

引用本文的文献

1
F. prausnitzii potentially modulates the association between citrus intake and depression.
Microbiome. 2024 Nov 14;12(1):237. doi: 10.1186/s40168-024-01961-3.
2
Systematic review and meta-analysis of the association between common variants and Alzheimer's disease in non-Hispanic White and Asian cohorts.
Front Aging Neurosci. 2024 Oct 17;16:1406573. doi: 10.3389/fnagi.2024.1406573. eCollection 2024.

本文引用的文献

1
False discovery rate regression: an application to neural synchrony detection in primary visual cortex.
J Am Stat Assoc. 2015;110(510):459-471. doi: 10.1080/01621459.2014.990973.
2
Abundant contribution of short tandem repeats to gene expression variation in humans.
Nat Genet. 2016 Jan;48(1):22-9. doi: 10.1038/ng.3461. Epub 2015 Dec 7.
3
Efficient set tests for the genetic analysis of correlated traits.
Nat Methods. 2015 Aug;12(8):755-8. doi: 10.1038/nmeth.3439. Epub 2015 Jun 15.
4
Detecting non-allelic homologous recombination from high-throughput sequencing data.
Genome Biol. 2015 Apr 8;16(1):72. doi: 10.1186/s13059-015-0633-1.
5
Statistics. The future lies in uncertainty.
Science. 2014 Jul 18;345(6194):264-5. doi: 10.1126/science.1251122. Epub 2014 Jul 17.
6
An estimate of the science-wise false discovery rate and application to the top medical literature.
Biostatistics. 2014 Jan;15(1):1-12. doi: 10.1093/biostatistics/kxt007. Epub 2013 Sep 25.
7
A powerful and efficient set test for genetic markers that handles confounders.
Bioinformatics. 2013 Jun 15;29(12):1526-33. doi: 10.1093/bioinformatics/btt177. Epub 2013 Apr 18.
8
Molecular analysis of SMN1, SMN2, NAIP, GTF2H2, and H4F5 genes in 157 Chinese patients with spinal muscular atrophy.
Gene. 2013 Apr 15;518(2):325-9. doi: 10.1016/j.gene.2012.12.109. Epub 2013 Jan 23.
9
Reproduction and immunity-driven natural selection in the human WFDC locus.
Mol Biol Evol. 2013 Apr;30(4):938-50. doi: 10.1093/molbev/mss329. Epub 2013 Jan 4.
10
CHANCE: comprehensive software for quality control and validation of ChIP-seq data.
Genome Biol. 2012 Oct 15;13(10):R98. doi: 10.1186/gb-2012-13-10-r98.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验