Dudbridge Frank, Gusnanto Arief, Koeleman Bobby P C
MRC Biostatistics Unit, Cambridge, UK.
Hum Genomics. 2006 Mar;2(5):310-7. doi: 10.1186/1479-7364-2-5-310.
Recent developments in the statistical analysis of genome-wide studies are reviewed. Genome-wide analyses are becoming increasingly common in areas such as scans for disease-associated markers and gene expression profiling. The data generated by these studies present new problems for statistical analysis, owing to the large number of hypothesis tests, comparatively small sample size and modest number of true gene effects. In this review, strategies are described for optimising the genotyping cost by discarding promising genes at an earlier stage, saving resources for the genes that show a trend of association. In addition, there is a review of new methods of analysis that combine evidence across genes to increase sensitivity to multiple true associations in the presence of many non-associated genes. Some methods achieve this by including only the most significant results, whereas others model the overall distribution of results as a mixture of distributions from true and null effects. Because genes are correlated even when having no effect, permutation testing is often necessary to estimate the overall significance, but this can be very time consuming. Efficiency can be improved by fitting a parametric distribution to permutation replicates, which can be re-used in subsequent analyses. Methods are also available to generate random draws from the permutation distribution. The review also includes discussion of new error measures that give a more reasonable interpretation of genome-wide studies, together with improved sensitivity. The false discovery rate allows a controlled proportion of positive results to be false, while detecting more true positives; and the local false discovery rate and false-positive report probability give clarity on whether or not a statistically significant test represents a real discovery.
本文综述了全基因组研究统计分析的最新进展。全基因组分析在疾病相关标志物扫描和基因表达谱分析等领域正变得越来越普遍。这些研究产生的数据给统计分析带来了新问题,这是由于假设检验数量众多、样本量相对较小以及真正的基因效应数量有限。在本综述中,描述了通过在早期舍弃有前景的基因来优化基因分型成本的策略,从而为显示关联趋势的基因节省资源。此外,还综述了新的分析方法,这些方法整合跨基因的证据,以提高在存在许多非关联基因的情况下对多个真实关联的敏感性。一些方法通过仅纳入最显著的结果来实现这一点,而其他方法则将结果的总体分布建模为真实效应和无效效应分布的混合。由于即使基因没有效应时它们之间也存在相关性,因此通常需要进行置换检验来估计总体显著性,但这可能非常耗时。通过对置换重复拟合参数分布可以提高效率,该分布可在后续分析中重复使用。也有方法可从置换分布中生成随机抽样。本综述还讨论了新的误差度量,这些度量能对全基因组研究给出更合理的解释,同时提高敏感性。错误发现率允许在控制阳性结果中一定比例的错误的同时检测到更多真实阳性;局部错误发现率和假阳性报告概率则明确了具有统计学显著性的检验是否代表真正的发现。