大数据，小样本。

Big Data, Small Sample.

作者信息

Gerlovina Inna, van der Laan Mark J, Hubbard Alan

出版信息

Int J Biostat. 2017 May 20;13(1):/j/ijb.2017.13.issue-1/ijb-2017-0012/ijb-2017-0012.xml. doi: 10.1515/ijb-2017-0012.

DOI:10.1515/ijb-2017-0012

Abstract

Multiple comparisons and small sample size, common characteristics of many types of "Big Data" including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates calculation of very small tail probabilities of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore the impact of departures of sampling distributions from typical assumptions on actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, suggesting potentially wide-spread problems with error rate control, specifically excessive false positives. This is an important factor that contributes to "reproducibility crisis". We also review some other commonly used methods (such as permutation and methods based on finite sampling inequalities) in their application to multiple testing/small sample data. We point out that Edgeworth expansions, providing higher order approximations to the sampling distribution, offer a promising direction for data analysis that could improve reliability of studies relying on large numbers of comparisons with modest sample sizes.

摘要

多重比较和小样本量是包括基因组研究产生的数据在内的许多类型“大数据”的共同特征，它们带来了影响推断可靠性的特定挑战。使用多重检验程序需要计算检验统计量分布的非常小的尾部概率。基于大偏差理论的结果提供了一个正式条件，该条件对于在实际样本量下保证错误率控制是必要的，它将检验次数和样本量联系起来；然而，这个条件很少得到满足。使用基于埃奇沃思展开式的方法（特别依赖彼得·霍尔的工作），我们探讨了抽样分布偏离典型假设对实际错误率的影响。我们的研究说明了实际错误率可能与宣称的名义水平相差多远，这表明错误率控制可能存在广泛的问题，特别是过多的假阳性。这是导致“可重复性危机”的一个重要因素。我们还回顾了一些其他常用方法（如置换法和基于有限抽样不等式的方法）在多重检验/小样本数据中的应用。我们指出，埃奇沃思展开式为抽样分布提供了高阶近似，为数据分析提供了一个有前景的方向，这可以提高依赖大量适度样本量比较的研究的可靠性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大数据，小样本。

Big Data, Small Sample.

作者信息

出版信息

相似文献

引用本文的文献

大数据，小样本。

Big Data, Small Sample.

作者信息

出版信息

相似文献

引用本文的文献