Göring H H, Terwilliger J D, Blangero J
Department of Genetics, Southwest Foundation for Biomedical Research, San Antonio, TX 78245-0549, USA.
Am J Hum Genet. 2001 Dec;69(6):1357-69. doi: 10.1086/324471. Epub 2001 Oct 9.
The primary goal of a genomewide scan is to estimate the genomic locations of genes influencing a trait of interest. It is sometimes said that a secondary goal is to estimate the phenotypic effects of each identified locus. Here, it is shown that these two objectives cannot be met reliably by use of a single data set of a currently realistic size. Simulation and analytical results, based on variance-components linkage analysis as an example, demonstrate that estimates of locus-specific effect size at genomewide LOD score peaks tend to be grossly inflated and can even be virtually independent of the true effect size, even for studies on large samples when the true effect size is small. However, the bias diminishes asymptotically. The explanation for the bias is that the LOD score is a function of the locus-specific effect-size estimate, such that there is a high correlation between the observed statistical significance and the effect-size estimate. When the LOD score is maximized over the many pointwise tests being conducted throughout the genome, the locus-specific effect-size estimate is therefore effectively maximized as well. We argue that attempts at bias correction give unsatisfactory results, and that pointwise estimation in an independent data set may be the only way of obtaining reliable estimates of locus-specific effect-and then only if one does not condition on statistical significance being obtained. We further show that the same factors causing this bias are responsible for frequent failures to replicate initial claims of linkage or association for complex traits, even when the initial localization is, in fact, correct. The findings of this study have wide-ranging implications, as they apply to all statistical methods of gene localization. It is hoped that, by keeping this bias in mind, we will more realistically interpret and extrapolate from the results of genomewide scans.
全基因组扫描的主要目标是估计影响感兴趣性状的基因在基因组中的位置。有时人们说其次要目标是估计每个已识别位点的表型效应。本文表明,使用当前实际大小的单个数据集无法可靠地实现这两个目标。以方差成分连锁分析为例的模拟和分析结果表明,在全基因组LOD得分峰值处,位点特异性效应大小的估计往往被严重夸大,甚至可能与真实效应大小几乎无关,即使对于大样本研究,当真实效应大小较小时也是如此。然而,这种偏差会渐近减小。偏差的解释是,LOD得分是位点特异性效应大小估计的函数,因此观察到的统计显著性与效应大小估计之间存在高度相关性。当在全基因组进行的许多逐点检验中LOD得分最大化时,位点特异性效应大小估计也因此有效地最大化了。我们认为,偏差校正的尝试结果并不理想,在独立数据集中进行逐点估计可能是获得位点特异性效应可靠估计的唯一方法——而且只有在不依赖于获得统计显著性的情况下才行。我们进一步表明,导致这种偏差的相同因素也是复杂性状连锁或关联的初始声明经常无法重复验证的原因,即使初始定位实际上是正确的。本研究的结果具有广泛的影响,因为它们适用于所有基因定位的统计方法。希望通过牢记这种偏差,我们能够更现实地解释全基因组扫描的结果并进行外推。