Stewart William C L, Peljto Anna L, Greenberg David A
Columbia University, Mailman School of Public Health, Division of Statistical Genetics, Department of Biostatistics, 722 W. 168th Street, 6th floor, New York, NY 10032, USA.
Hum Hered. 2010;69(3):152-9. doi: 10.1159/000267995. Epub 2009 Dec 18.
BACKGROUND/AIMS: Current linkage studies detect and localize trait loci using genotypes sampled at hundreds of thousands of single nucleotide polymorphisms (SNPs). Such data should provide precise estimates of trait location once linkage has been established. However, correlations between nearby SNPs can distort the information about trait location. Traditionally, when faced with this dilemma, three approaches have been used: (1) ignore the correlation; (2) approximate the correlation; or, (3) analyze a single, approximately uncorrelated subset of the original dense data.
Here, we examine and test a simple and efficient estimator of trait location that averages location estimates across random subsamples of the original dense data. Based on pairwise estimates of correlation, we ensure that the SNPs within each subsample are approximately uncorrelated. In addition, we use the nonparametric bootstrap procedure to compute narrow, high-resolution candidate gene regions (i.e. confidence intervals for the true trait location).
Using simulated data, we show that the three existing approaches to dense SNP linkage analysis (described above) can yield biased and/or inefficient estimation depending on the underlying correlation structure. With respect to mean squared error, our estimator outperforms the third approach, and is as good as, but usually better than the first and second approaches. Relative to the third approach, our estimator led to a 47.5% reduction in the candidate gene region length based on the analysis of 15 hypertension families genotyped at approximately 500,000 SNPs.
The method we developed will be an important tool for constructing high-resolution candidate gene regions that could ultimately aid in targeting regions for sequencing projects.
背景/目的:当前的连锁研究利用数十万个单核苷酸多态性(SNP)位点的基因型来检测和定位性状基因座。一旦确定连锁关系,此类数据应能提供性状位置的精确估计。然而,相邻SNP之间的相关性可能会扭曲有关性状位置的信息。传统上,面对这一困境时采用了三种方法:(1)忽略相关性;(2)近似相关性;或者,(3)分析原始密集数据中单个近似不相关的子集。
在此,我们研究并测试了一种简单有效的性状位置估计方法,该方法对原始密集数据的随机子样本的位置估计进行平均。基于成对相关性估计,我们确保每个子样本中的SNP近似不相关。此外,我们使用非参数自助法程序来计算狭窄、高分辨率的候选基因区域(即真实性状位置的置信区间)。
使用模拟数据,我们表明上述三种现有的密集SNP连锁分析方法可能会根据潜在的相关结构产生有偏差和/或低效的估计。就均方误差而言,我们的估计方法优于第三种方法,与第一种和第二种方法相当,但通常更好。相对于第三种方法,基于对15个高血压家族进行约50万个SNP基因分型的分析,我们的估计方法使候选基因区域长度减少了47.5%。
我们开发的方法将成为构建高分辨率候选基因区域的重要工具,最终有助于确定测序项目的目标区域。