Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China.
Department of Statistics and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA.
Syst Biol. 2022 Oct 12;71(6):1348-1361. doi: 10.1093/sysbio/syac040.
Whole-genome duplication (WGD) occurs broadly and repeatedly across the history of eukaryotes and is recognized as a prominent evolutionary force, especially in plants. Immediately following WGD, most genes are present in two copies as paralogs. Due to this redundancy, one copy of a paralog pair commonly undergoes pseudogenization and is eventually lost. When speciation occurs shortly after WGD; however, differential loss of paralogs may lead to spurious phylogenetic inference resulting from the inclusion of pseudoorthologs-paralogous genes mistakenly identified as orthologs because they are present in single copies within each sampled species. The influence and impact of including pseudoorthologs versus true orthologs as a result of gene extinction (or incomplete laboratory sampling) are only recently gaining empirical attention in the phylogenomics community. Moreover, few studies have yet to investigate this phenomenon in an explicit coalescent framework. Here, using mathematical models, numerous simulated data sets, and two newly assembled empirical data sets, we assess the effect of pseudoorthologs on species tree estimation under varying degrees of incomplete lineage sorting (ILS) and differential gene loss scenarios following WGD. When gene loss occurs along the terminal branches of the species tree, alignment-based (BPP) and gene-tree-based (ASTRAL, MP-EST, and STAR) coalescent methods are adversely affected as the degree of ILS increases. This can be greatly improved by sampling a sufficiently large number of genes. Under the same circumstances, however, concatenation methods consistently estimate incorrect species trees as the number of genes increases. Additionally, pseudoorthologs can greatly mislead species tree inference when gene loss occurs along the internal branches of the species tree. Here, both coalescent and concatenation methods yield inconsistent results. These results underscore the importance of understanding the influence of pseudoorthologs in the phylogenomics era. [Coalescent method; concatenation method; incomplete lineage sorting; pseudoorthologs; single-copy gene; whole-genome duplication.].
全基因组复制(WGD)在真核生物的历史中广泛而反复地发生,被认为是一种主要的进化力量,特别是在植物中。WGD 后,大多数基因以两个拷贝的形式存在,作为旁系同源物。由于这种冗余性,一个旁系同源物对的一个拷贝通常会经历假基因化并最终丢失。然而,如果 WGD 后不久发生物种形成,旁系同源物的差异丢失可能导致虚假的系统发育推断,这是由于包含假同源物(由于它们在每个采样物种中仅以单拷贝形式存在而被错误识别为同源物的旁系同源基因)造成的。由于基因灭绝(或不完全的实验室采样)而包括假同源物与真正的同源物的影响和作用,在系统基因组学领域最近才得到实证关注。此外,很少有研究在明确的合并框架中研究这种现象。在这里,我们使用数学模型、大量模拟数据集和两个新组装的经验数据集,评估了在不同程度的不完全谱系分选(ILS)和 WGD 后差异基因丢失情况下,假同源物对物种树估计的影响。当基因丢失沿着物种树的末端分支发生时,基于比对(BPP)和基于基因树(ASTRAL、MP-EST 和 STAR)的合并方法会随着 ILS 程度的增加而受到不利影响。通过足够大数量的基因采样,可以大大改善这种情况。然而,在相同的情况下,当基因丢失沿着物种树的内部分支发生时,连接方法始终会估计出错误的物种树。此外,当基因丢失沿着物种树的内部分支发生时,假同源物会极大地误导物种树推断。在这里,合并和连接方法都产生了不一致的结果。这些结果强调了在系统基因组学时代理解假同源物影响的重要性。