Kuhner Mary K
Department of Genome Sciences, University of Washington, Seattle, WA, USA.
Mol Biol Evol. 2006 Dec;23(12):2355-60. doi: 10.1093/molbev/msl106. Epub 2006 Sep 6.
Data from HIV and from human neoplastic cells can show substantial between-lineage mutation rate variation even within a single population. Such variation may affect estimators of population quantities such as Theta = 4N(e)mu. Using simulated DNA data, I measured the effect of rate variation on recovery of Theta by the summary-statistic estimator of Watterson (Watterson GA. 1975. On the number of segregating sites in genetical systems without recombination. Theor Popul Biol. 7:256-276) and the coalescent maximum likelihood algorithm LAMARC (Kuhner MK. 2006. LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics. Advance Access doi: 10.1093/bioinformatics/btk051). Watterson's estimator showed a downward bias, as expected, with high values of Theta. LAMARC's mean estimate was accurate for all tested values of Theta and rate variation except for a downward bias when rate variation was maximal (i.e., the slow rate was zero). LAMARC had consistently narrower confidence intervals (CIs) than Watterson's estimator. Both methods tended to reject the truth too often when rate variation was 8x or greater and independent among branches, as well as when variation was 4x or greater and correlated among branches. In the case of Watterson's estimate, this excess rejection was fully attributable to variation among genealogies in the amount of total branch length associated with the fast and slow rates. However, in the case of LAMARC, some excess rejection was still observed even when between-genealogy variation was taken into account. Both estimators are robust to modest rate variation; however, their use should be coupled with a statistical test to rule out extreme rate variation as the resulting CIs may not be reliable.
来自HIV和人类肿瘤细胞的数据表明,即使在单一群体中,不同谱系之间的突变率也可能存在显著差异。这种差异可能会影响群体数量估计值,如θ=4N(e)μ。通过模拟DNA数据,我测量了突变率差异对通过沃特森(Watterson GA. 1975. 关于无重组遗传系统中分离位点的数量。理论种群生物学。7:256 - 276)的汇总统计估计器和合并最大似然算法LAMARC(Kuhner MK. 2006. LAMARC 2.0:群体参数的最大似然和贝叶斯估计。生物信息学。预印本doi: 10.1093/bioinformatics/btk051)恢复θ的影响。正如预期的那样,对于较高的θ值,沃特森估计器显示出向下的偏差。对于除了突变率差异最大(即慢速为零)时的向下偏差之外的所有测试的θ值和突变率差异,LAMARC的平均估计都是准确的。LAMARC的置信区间(CI)始终比沃特森估计器更窄。当突变率差异为8倍或更大且在分支间独立时,以及当差异为4倍或更大且在分支间相关时,两种方法都倾向于过于频繁地拒绝真实值。在沃特森估计的情况下,这种过度拒绝完全归因于不同谱系中与快速和慢速相关的总分支长度数量的差异。然而,在LAMARC的情况下,即使考虑了谱系间差异,仍观察到一些过度拒绝。两种估计器对适度的突变率差异都具有鲁棒性;然而,它们的使用应与统计检验相结合以排除极端的突变率差异,因为由此产生的置信区间可能不可靠。