Département de Biochimie, Centre Robert-Cedergren, Université de Montréal, Montréal, Québec, Canada.
Mol Biol Evol. 2013 Jan;30(1):197-214. doi: 10.1093/molbev/mss208. Epub 2012 Aug 28.
Progress in sequencing technology allows researchers to assemble ever-larger supermatrices for phylogenomic inference. However, current phylogenomic studies often rest on patchy data sets, with some having 80% missing (or ambiguous) data or more. Though early simulations had suggested that missing data per se do not harm phylogenetic inference when using sufficiently large data sets, Lemmon et al. (Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst Biol. 58:130-145.) have recently cast doubt on this consensus in a study based on the introduction of parsimony-uninformative incomplete characters. In this work, we empirically reassess the issue of missing data in phylogenomics while exploring possible interactions with the model of sequence evolution. First, we note that parsimony-uninformative incomplete characters are actually informative in a probabilistic framework. A reanalysis of Lemmon's data set with this in mind gives a very different interpretation of their results and shows that some of their conclusions may be unfounded. Second, we investigate the effect of the progressive introduction of missing data in a complete supermatrix (126 genes × 39 species) capable of resolving animal relationships. These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power. In particular, they exacerbate systematic errors by reducing the number of species effectively available for the detection of multiple substitutions. Consequently, large sparse supermatrices are more sensitive to phylogenetic artifacts than smaller but less incomplete data sets, which argue for experimental designs aimed at collecting a modest number (~50) of highly covered genes. Our results further confirm that including incomplete yet short-branch taxa (i.e., slowly evolving species or close outgroups) can help to eschew artifacts, as predicted by simulations. Finally, it appears that selecting an adequate model of sequence evolution (e.g., the site-heterogeneous CAT model instead of the site-homogeneous WAG model) is more beneficial to phylogenetic accuracy than reducing the level of missing data.
测序技术的进步使研究人员能够组装越来越大的超级矩阵进行系统基因组学推断。然而,目前的系统基因组学研究通常依赖于不完整的数据,有些数据缺失率(或模糊性)达到 80%或更高。虽然早期的模拟研究表明,当使用足够大的数据集时,缺失数据本身不会损害系统发育推断,但 Lemmon 等人(Lemmon AR、Brown JM、Stanger-Hall K、Lemmon EM. 2009. 模糊数据对最大似然和贝叶斯推断获得的系统发育估计的影响。系统生物学。58:130-145.)最近对基于引入简约无信息不完整特征的共识提出了质疑。在这项工作中,我们通过探索与序列进化模型的可能相互作用,重新评估系统基因组学中缺失数据的问题。首先,我们注意到,在概率框架中,简约无信息不完整特征实际上是有信息的。考虑到这一点,对 Lemmon 数据集的重新分析给出了对其结果的截然不同的解释,并表明他们的一些结论可能没有根据。其次,我们研究了在一个能够解决动物关系的完整超级矩阵(126 个基因×39 个物种)中逐步引入缺失数据的影响。这些分析表明,缺失数据会略微扰乱系统发育推断,超出预期的分辨率下降。特别是,它们通过减少有效用于检测多个替换的物种数量来加剧系统误差。因此,与较小但不完整的数据集相比,稀疏的大超级矩阵对系统发育伪影更为敏感,这证明了旨在收集适度数量(~50)高覆盖率基因的实验设计的合理性。我们的结果进一步证实,正如模拟预测的那样,包括不完整但短分支的分类群(即缓慢进化的物种或密切的外群)可以帮助避免伪影。最后,似乎选择适当的序列进化模型(例如,站点异质 CAT 模型而不是站点同质 WAG 模型)比减少缺失数据水平更有利于系统发育准确性。