Liu Xiaoming, Maxwell Taylor J, Boerwinkle Eric, Fu Yun-Xin
Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, TX, USA.
Mol Biol Evol. 2009 Jul;26(7):1479-90. doi: 10.1093/molbev/msp059. Epub 2009 Mar 24.
One challenge of analyzing samples of DNA sequences is to account for the nonnegligible polymorphisms produced by error when the sequencing error rate is high or the sample size is large. Specifically, those artificial sequence variations will bias the observed single nucleotide polymorphism (SNP) frequency spectrum, which in turn may further bias the estimators of the population mutation rate theta =4N mu for diploids. In this paper, we propose a new approach based on the generalized least squares (GLS) method to estimate theta, given a SNP frequency spectrum in a random sample of DNA sequences from a population. With this approach, error rate epsilon can be either known or unknown. In the latter case, epsilon can be estimated given an estimation of theta. Using coalescent simulation, we compared our estimators with other estimators of theta. The results showed that the GLS estimators are more efficient than other theta estimators with error, and the estimation of epsilon is usable in practice when the theta per bp is small. We demonstrate the application of the estimators with 10-kb noncoding region sequence sampled from a human population and provide suggestions for choosing theta estimators with error.
分析DNA序列样本的一个挑战是,当测序错误率较高或样本量较大时,要考虑由错误产生的不可忽略的多态性。具体而言,那些人为的序列变异会使观察到的单核苷酸多态性(SNP)频率谱产生偏差,这反过来可能会进一步使二倍体群体突变率θ = 4Nμ的估计值产生偏差。在本文中,给定来自一个群体的DNA序列随机样本中的SNP频率谱,我们提出了一种基于广义最小二乘法(GLS)来估计θ的新方法。使用这种方法时,错误率ε可以是已知的,也可以是未知的。在后一种情况下,给定θ的估计值时可以估计ε。通过合并模拟,我们将我们的估计值与其他θ估计值进行了比较。结果表明,GLS估计值比其他带误差的θ估计值更有效,并且当每碱基对的θ较小时,ε的估计在实际中是可行的。我们展示了从人类群体中采样的10kb非编码区序列估计值的应用,并为选择带误差的θ估计值提供了建议。