• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大数据,小样本。

Big Data, Small Sample.

作者信息

Gerlovina Inna, van der Laan Mark J, Hubbard Alan

出版信息

Int J Biostat. 2017 May 20;13(1):/j/ijb.2017.13.issue-1/ijb-2017-0012/ijb-2017-0012.xml. doi: 10.1515/ijb-2017-0012.

DOI:10.1515/ijb-2017-0012
PMID:28599385
Abstract

Multiple comparisons and small sample size, common characteristics of many types of "Big Data" including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates calculation of very small tail probabilities of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore the impact of departures of sampling distributions from typical assumptions on actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, suggesting potentially wide-spread problems with error rate control, specifically excessive false positives. This is an important factor that contributes to "reproducibility crisis". We also review some other commonly used methods (such as permutation and methods based on finite sampling inequalities) in their application to multiple testing/small sample data. We point out that Edgeworth expansions, providing higher order approximations to the sampling distribution, offer a promising direction for data analysis that could improve reliability of studies relying on large numbers of comparisons with modest sample sizes.

摘要

多重比较和小样本量是包括基因组研究产生的数据在内的许多类型“大数据”的共同特征,它们带来了影响推断可靠性的特定挑战。使用多重检验程序需要计算检验统计量分布的非常小的尾部概率。基于大偏差理论的结果提供了一个正式条件,该条件对于在实际样本量下保证错误率控制是必要的,它将检验次数和样本量联系起来;然而,这个条件很少得到满足。使用基于埃奇沃思展开式的方法(特别依赖彼得·霍尔的工作),我们探讨了抽样分布偏离典型假设对实际错误率的影响。我们的研究说明了实际错误率可能与宣称的名义水平相差多远,这表明错误率控制可能存在广泛的问题,特别是过多的假阳性。这是导致“可重复性危机”的一个重要因素。我们还回顾了一些其他常用方法(如置换法和基于有限抽样不等式的方法)在多重检验/小样本数据中的应用。我们指出,埃奇沃思展开式为抽样分布提供了高阶近似,为数据分析提供了一个有前景的方向,这可以提高依赖大量适度样本量比较的研究的可靠性。

相似文献

1
Big Data, Small Sample.大数据,小样本。
Int J Biostat. 2017 May 20;13(1):/j/ijb.2017.13.issue-1/ijb-2017-0012/ijb-2017-0012.xml. doi: 10.1515/ijb-2017-0012.
2
A comparison of exact tests for trend with binary endpoints using Bartholomew's statistic.使用巴塞洛缪统计量对具有二元终点的趋势精确检验进行比较。
Int J Biostat. 2014;10(2):221-30. doi: 10.1515/ijb-2014-0013.
3
Quantile-function based null distribution in resampling based multiple testing.基于重采样的多重检验中基于分位数函数的零分布。
Stat Appl Genet Mol Biol. 2006;5:Article14. doi: 10.2202/1544-6115.1199. Epub 2006 May 21.
4
The significance of non-significance.无显著性的意义。
QJM. 1998 Sep;91(9):647-53. doi: 10.1093/qjmed/91.9.647.
5
Significance levels for studies with correlated test statistics.具有相关检验统计量的研究的显著性水平。
Biostatistics. 2008 Jul;9(3):458-66. doi: 10.1093/biostatistics/kxm047. Epub 2007 Dec 18.
6
To permute or not to permute.是否进行置换。
Bioinformatics. 2006 Sep 15;22(18):2244-8. doi: 10.1093/bioinformatics/btl383. Epub 2006 Jul 26.
7
A method to increase the power of multiple testing procedures through sample splitting.一种通过样本分割提高多重检验程序功效的方法。
Stat Appl Genet Mol Biol. 2006;5:Article19. doi: 10.2202/1544-6115.1148. Epub 2006 Aug 1.
8
Bootstrap and second-order tests of risk difference.风险差异的自举法和二阶检验
Biometrics. 2010 Sep;66(3):975-82. doi: 10.1111/j.1541-0420.2009.01354.x.
9
An introduction to statistical inference--3.统计推断导论——3。
J Accid Emerg Med. 2000 Sep;17(5):357-63. doi: 10.1136/emj.17.5.357.
10
Standard error and sample size determination for estimation of probabilities based on a test variable.基于测试变量估计概率时的标准误差与样本量确定
J Clin Epidemiol. 1996 Apr;49(4):419-29. doi: 10.1016/0895-4356(95)00570-6.

引用本文的文献

1
Deconstructing Intratumoral Heterogeneity through Multiomic and Multiscale Analysis of Serial Sections.通过对连续切片进行多组学和多尺度分析解构肿瘤内异质性
Cancers (Basel). 2024 Jul 1;16(13):2429. doi: 10.3390/cancers16132429.
2
Deconstructing intratumoral heterogeneity through multiomic and multiscale analysis of serial sections.通过对连续切片进行多组学和多尺度分析来解构肿瘤内异质性。
bioRxiv. 2024 Mar 18:2023.06.21.545365. doi: 10.1101/2023.06.21.545365.
3
A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology.
高维生物学中数据自适应半参数估计的适度统计的推广。
Stat Methods Med Res. 2023 Mar;32(3):539-554. doi: 10.1177/09622802221146313. Epub 2022 Dec 26.
4
False and true positives in arthropod thermal adaptation candidate gene lists.节肢动物热适应候选基因列表中的假阳性和真阳性。
Genetica. 2021 Jun;149(3):143-153. doi: 10.1007/s10709-021-00122-w. Epub 2021 May 7.