Lanier Hayley C, Huang Huateng, Knowles L Lacey
(a)Department of Ecology and Evolutionary Biology, Museum of Zoology, University of Michigan, Ann Arbor, MI 48109-1079, USA; (b)Department of Zoology and Physiology, University of Wyoming-Casper, Casper, WY 82601, USA.
Mol Phylogenet Evol. 2014 Jan;70:112-9. doi: 10.1016/j.ympev.2013.09.006. Epub 2013 Sep 21.
Although species-tree methods have been widely adopted for multi-locus data, little consideration has been given to the source and character of the loci used in these approaches. Decisions about which loci to target in empirical studies are typically constrained by availability, technology and funds - characteristics that are not typically considered in simulation studies. As a result, most real-world datasets often combine one or two variable loci (such as mtDNA or chloroplast loci) with multiple lower-variation loci to estimate species trees. These locus selections impact the accuracy and the resolution of a phylogeny. Furthermore, the fact that using a larger sample of loci can result in lower posterior probabilities has been used as an excuse to drop loci from an analysis. Here we address these issues directly through a simulation approach designed to mimic situations arising in empirical datasets by combining loci with differing mutation rates. We show that low-variation loci can be utilized in species-tree analyses that account for gene-tree uncertainty (e.g., a Bayesian framework), whereas maximum likelihood approaches show no improvement in accuracy when low-variation loci are added. We demonstrate that limited phylogenetic signal associated with low-variation loci constrains gains in species-tree estimation accuracy when adding loci. Lastly, we demonstrate that the inclusion of only a handful of loci with higher mutation rates, and hence greater phylogenetic information content, can make a tremendous difference in the accuracy of species-tree estimates, suggesting that empiricists should consider the quality, and not just quantity, of loci in multi-locus phylogenetic analyses.
尽管物种树方法已被广泛应用于多基因座数据,但对于这些方法中所使用基因座的来源和特征却很少有人关注。在实证研究中,关于选择哪些基因座作为目标的决策通常受到可获得性、技术和资金的限制——而这些特征在模拟研究中通常不会被考虑。因此,大多数实际数据集常常将一两个可变基因座(如线粒体DNA或叶绿体基因座)与多个变异较小的基因座结合起来以估计物种树。这些基因座的选择会影响系统发育树的准确性和分辨率。此外,使用更大的基因座样本可能会导致较低的后验概率这一事实,已被用作从分析中剔除基因座的借口。在这里,我们通过一种模拟方法直接解决这些问题,该方法旨在通过结合具有不同突变率的基因座来模拟实证数据集中出现的情况。我们表明,低变异基因座可用于考虑基因树不确定性的物种树分析(例如贝叶斯框架),而当添加低变异基因座时,最大似然法在准确性上并无提高。我们证明,与低变异基因座相关的有限系统发育信号在添加基因座时会限制物种树估计准确性的提高。最后,我们证明,仅纳入少数具有较高突变率、因而具有更大系统发育信息含量的基因座,就能在物种树估计的准确性上产生巨大差异,这表明实证研究人员在多基因座系统发育分析中应考虑基因座的质量而非数量。