Hernandez Ryan D, Williamson Scott H, Bustamante Carlos D
Biological Statistics and Computational Biology, Cornell University, NY, USA.
Mol Biol Evol. 2007 Aug;24(8):1792-800. doi: 10.1093/molbev/msm108. Epub 2007 Jun 1.
Population genetic analyses often use polymorphism data from one species, and orthologous genomic sequences from closely related outgroup species. These outgroup sequences are frequently used to identify ancestral alleles at segregating sites and to compare the patterns of polymorphism and divergence. Inherent in such studies is the assumption of parsimony, which posits that the ancestral state of each single nucleotide polymorphism (SNP) is the allele that matches the orthologous site in the outgroup sequence, and that all nucleotide substitutions between species have been observed. This study tests the effect of violating the parsimony assumption when mutation rates vary across sites and over time. Using a context-dependent mutation model that accounts for elevated mutation rates at CpG dinucleotides, increased propensity for transitional versus transversional mutations, as well as other directional and contextual mutation biases estimated along the human lineage, we show (using both simulations and a theoretical model) that enough unobserved substitutions could have occurred since the divergence of human and chimpanzee to cause many statistical tests to spuriously reject neutrality. Moreover, using both the chimpanzee and rhesus macaque genomes to parsimoniously identify ancestral states causes a large fraction of the data to be removed while not completely alleviating problem. By constructing a novel model of the context-dependent mutation process, we can correct polymorphism data for the effect of ancestral misidentification using a single outgroup.
群体遗传学分析通常使用来自一个物种的多态性数据,以及来自密切相关的外类群物种的直系同源基因组序列。这些外类群序列经常被用于识别分离位点上的祖先等位基因,并比较多态性和分化模式。此类研究中内在的是简约性假设,该假设认为每个单核苷酸多态性(SNP)的祖先状态是与外类群序列中的直系同源位点匹配的等位基因,并且已经观察到物种之间的所有核苷酸替换。本研究测试了在位点间和随时间突变率变化时违反简约性假设的影响。使用一个依赖上下文的突变模型,该模型考虑了CpG二核苷酸处升高的突变率、转换与颠换突变的增加倾向,以及沿人类谱系估计的其他方向性和上下文突变偏差,我们表明(使用模拟和理论模型),自人类和黑猩猩分化以来,可能已经发生了足够多未观察到的替换,从而导致许多统计检验错误地拒绝中性。此外,使用黑猩猩和恒河猴基因组来简约地识别祖先状态会导致很大一部分数据被去除,同时又不能完全缓解问题。通过构建一个依赖上下文的突变过程的新模型,我们可以使用单个外类群校正多态性数据中祖先错误识别的影响。