Department of Biological Physics, Eötvös University, Budapest, Hungary.
ELTE-MTA "Lendület" Evolutionary Genomics Research Group, Budapest, Hungary.
Syst Biol. 2023 Aug 7;72(4):767-780. doi: 10.1093/sysbio/syad013.
Accurate phylogenies are fundamental to our understanding of the pattern and process of evolution. Yet, phylogenies at deep evolutionary timescales, with correspondingly long branches, have been fraught with controversy resulting from conflicting estimates from models with varying complexity and goodness of fit. Analyses of historical as well as current empirical datasets, such as alignments including Microsporidia, Nematoda, or Platyhelminthes, have demonstrated that inadequate modeling of across-site compositional heterogeneity, which is the result of biochemical constraints that lead to varying patterns of accepted amino acids along sequences, can lead to erroneous topologies that are strongly supported. Unfortunately, models that adequately account for across-site compositional heterogeneity remain computationally challenging or intractable for an increasing fraction of contemporary datasets. Here, we introduce "compositional constraint analysis," a method to investigate the effect of site-specific constraints on amino acid composition on phylogenetic inference. We show that more constrained sites with lower diversity and less constrained sites with higher diversity exhibit ostensibly conflicting signals under models ignoring across-site compositional heterogeneity that lead to long-branch attraction artifacts and demonstrate that more complex models accounting for across-site compositional heterogeneity can ameliorate this bias. We present CAT-posterior mean site frequencies (PMSF), a pipeline for diagnosing and resolving phylogenetic bias resulting from inadequate modeling of across-site compositional heterogeneity based on the CAT model. CAT-PMSF is robust against long-branch attraction in all alignments we have examined. We suggest using CAT-PMSF when convergence of the CAT model cannot be assured. We find evidence that compositionally constrained sites are driving long-branch attraction in two metazoan datasets and recover evidence for Porifera as the sister group to all other animals. [Animal phylogeny; cross-site heterogeneity; long-branch attraction; phylogenomics.].
准确的系统发育对于我们理解进化的模式和过程至关重要。然而,在进化的深层时间尺度上,分支相应地变长,由于具有不同复杂性和拟合度的模型的冲突估计,系统发育一直存在争议。对历史和当前经验数据集的分析,例如包括微孢子虫、线虫或扁形动物的排列,表明跨位点组成异质性的建模不足,这是生化限制导致沿序列接受的氨基酸模式变化的结果,会导致错误的拓扑结构得到强烈支持。不幸的是,能够充分解释跨位点组成异质性的模型对于越来越多的当代数据集来说仍然具有计算挑战性或难以处理。在这里,我们引入了“组成约束分析”,这是一种研究位点特异性约束对氨基酸组成对系统发育推断影响的方法。我们表明,在忽略跨位点组成异质性的模型下,具有较低多样性的更受约束的位点和具有较高多样性的较少受约束的位点表现出表面上相互矛盾的信号,这会导致长枝吸引伪影,并表明能够更好地解释跨位点组成异质性的更复杂模型可以减轻这种偏差。我们提出了 CAT-后验平均位点频率 (PMSF),这是一种基于 CAT 模型诊断和解决因跨位点组成异质性建模不足而导致的系统发育偏差的管道。CAT-PMSF 在我们检查的所有排列中都能抵抗长枝吸引。我们建议在无法保证 CAT 模型收敛时使用 CAT-PMSF。我们发现有证据表明,组成受约束的位点正在驱动两个后生动物数据集的长枝吸引,并恢复了多孔动物作为所有其他动物的姐妹群的证据。 [动物系统发育;跨位点异质性;长枝吸引;系统基因组学。]。