School of Mathematics and Statistics/Melbourne Integrative Genomics, The University of Melbourne, Melbourne 3010, Australia.
Alibaba Cloud, Hangzhou, China.
Syst Biol. 2024 Jul 27;73(2):355-374. doi: 10.1093/sysbio/syae007.
The evolution of gene families is complex, involving gene-level evolutionary events such as gene duplication, horizontal gene transfer, and gene loss, and other processes such as incomplete lineage sorting (ILS). Because of this, topological differences often exist between gene trees and species trees. A number of models have been recently developed to explain these discrepancies, the most realistic of which attempts to consider both gene-level events and ILS. When unified in a single model, the interaction between ILS and gene-level events can cause polymorphism in gene copy number, which we refer to as copy number hemiplasy (CNH). In this paper, we extend the Wright-Fisher process to include duplications and losses over several species, and show that the probability of CNH for this process can be significant. We study how well two unified models-multilocus multispecies coalescent (MLMSC), which models CNH, and duplication, loss, and coalescence (DLCoal), which does not-approximate the Wright-Fisher process with duplication and loss. We then study the effect of CNH on gene family evolution by comparing MLMSC and DLCoal. We generate comparable gene trees under both models, showing significant differences in various summary statistics; most importantly, CNH reduces the number of gene copies greatly. If this is not taken into account, the traditional method of estimating duplication rates (by counting the number of gene copies) becomes inaccurate. The simulated gene trees are also used for species tree inference with the summary methods ASTRAL and ASTRAL-Pro, demonstrating that their accuracy, based on CNH-unaware simulations calibrated on real data, may have been overestimated.
基因家族的进化是复杂的,涉及基因水平的进化事件,如基因复制、水平基因转移和基因丢失,以及不完全谱系分选(ILS)等其他过程。因此,基因树和种系发生树之间经常存在拓扑差异。最近已经开发了许多模型来解释这些差异,其中最现实的模型试图同时考虑基因水平事件和 ILS。当统一在一个单一的模型中时,ILS 和基因水平事件之间的相互作用会导致基因拷贝数的多态性,我们称之为拷贝数半合子(CNH)。在本文中,我们将 Wright-Fisher 过程扩展到包括几个物种的复制和丢失,并表明该过程的 CNH 概率可能很大。我们研究了两种统一模型——多基因多位点种系发生(MLMSC),它模拟 CNH,以及复制、丢失和合并(DLCoal),后者不模拟 CNH——与复制和丢失的 Wright-Fisher 过程的拟合程度。然后,我们通过比较 MLMSC 和 DLCoal 来研究 CNH 对基因家族进化的影响。我们在这两种模型下生成可比的基因树,显示出各种汇总统计数据的显著差异;最重要的是,CNH 大大减少了基因拷贝数。如果不考虑这一点,传统的估计复制率的方法(通过计算基因拷贝数)就变得不准确。模拟的基因树也用于基于汇总方法 ASTRAL 和 ASTRAL-Pro 的种系发生树推断,表明它们的准确性,基于在真实数据上校准的无 CNH 模拟,可能被高估了。