通过重采样减少全基因组研究中的选择偏倚。

Reduction of selection bias in genomewide studies by resampling.

作者信息

Sun Lei, Bull Shelley B

机构信息

Department of Public Health Sciences, University of Toronto, Toronto, Canada.

出版信息

Genet Epidemiol. 2005 May;28(4):352-67. doi: 10.1002/gepi.20068.

DOI:10.1002/gepi.20068

PMID:15761913

Abstract

The accuracy of gene localization, the reliability of locus-specific effect estimates, and the ability to replicate initial claims of linkage and/or association have emerged as major methodological concerns in genomewide studies of complex diseases and quantitative traits. To address the issue of multiple comparisons inherent in genomewide studies, the use of stringent criteria for assessing statistical significance has been generally acknowledged as a strategy to control type I error. However, the application of genomewide significance criteria does not take account of the selection bias introduced into parameter estimates, e.g., estimates of locus-specific effect size of disease/trait loci. Some have argued that reliable locus-specific parameter estimates can only be obtained in an independent sample. In this report, we examine statistical resampling techniques, including cross-validation and the bootstrap, applied to the initial sample to improve the estimation of locus-specific effects. We compare them with the naive method in which all data are used for both hypothesis testing and parameter estimation, as well as with the split-sample approach in which part of the data are reserved for estimation. Upward bias of the naive estimator and inadequacy of the split-sample approach are derived analytically under a simple quantitative trait model. Simulation studies of the resampling methods are performed for both the simple model and a more realistic genomewide linkage analysis. Our results suggest that cross-validation and bootstrap methods can substantially reduce the estimation bias, especially when the effect size is small or there is no genetic effect.

摘要

在复杂疾病和数量性状的全基因组研究中，基因定位的准确性、位点特异性效应估计的可靠性以及重复最初连锁和/或关联声明的能力已成为主要的方法学关注点。为了解决全基因组研究中固有的多重比较问题，使用严格标准来评估统计显著性已被普遍认为是控制I型错误的一种策略。然而，应用全基因组显著性标准并未考虑引入参数估计中的选择偏差，例如疾病/性状位点的位点特异性效应大小估计。一些人认为，可靠的位点特异性参数估计只能在独立样本中获得。在本报告中，我们研究了应用于初始样本的统计重采样技术，包括交叉验证和自助法，以改进位点特异性效应的估计。我们将它们与将所有数据用于假设检验和参数估计的简单方法以及将部分数据留作估计的拆分样本方法进行比较。在一个简单的数量性状模型下，通过分析得出了简单估计器的向上偏差和拆分样本方法的不足。针对简单模型和更现实的全基因组连锁分析对重采样方法进行了模拟研究。我们的结果表明，交叉验证和自助法可以大幅减少估计偏差，尤其是当效应大小较小时或不存在遗传效应时。