Frankel Lauren E, Ané Cécile
Department of Botany, University of Wisconsin-Madison, Madison, WI 53706, USA.
Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA.
Syst Biol. 2023 Dec 30;72(6):1357-1369. doi: 10.1093/sysbio/syad056.
The evolutionary implications and frequency of hybridization and introgression are increasingly being recognized across the tree of life. To detect hybridization from multi-locus and genome-wide sequence data, a popular class of methods are based on summary statistics from subsets of 3 or 4 taxa. However, these methods often carry the assumption of a constant substitution rate across lineages and genes, which is commonly violated in many groups. In this work, we quantify the effects of rate variation on the D test (also known as ABBA-BABA test), the D3 test, and HyDe. All 3 tests are used widely across a range of taxonomic groups, in part because they are very fast to compute. We consider rate variation across species lineages, across genes, their lineage-by-gene interaction, and rate variation across gene-tree edges. We simulated species networks according to a birth-death-hybridization process, so as to capture a range of realistic species phylogenies. For all 3 methods tested, we found a marked increase in the false discovery of reticulation (type-1 error rate) when there is rate variation across species lineages. The D3 test was the most sensitive, with around 80% type-1 error, such that D3 appears to more sensitive to a departure from the clock than to the presence of reticulation. For all 3 tests, the power to detect hybridization events decreased as the number of hybridization events increased, indicating that multiple hybridization events can obscure one another if they occur within a small subset of taxa. Our study highlights the need to consider rate variation when using site-based summary statistics, and points to the advantages of methods that do not require assumptions on evolutionary rates across lineages or across genes.
在整个生命之树中,杂交和基因渗入的进化影响及频率越来越受到人们的认可。为了从多位点和全基因组序列数据中检测杂交现象,一类流行的方法是基于3个或4个分类单元子集的汇总统计量。然而,这些方法通常假定各谱系和基因的替换率恒定,而这一假定在许多类群中常常不成立。在这项研究中,我们量化了速率变化对D检验(也称为ABBA - BABA检验)、D3检验和HyDe的影响。这三种检验在一系列分类群中都被广泛使用,部分原因是它们计算速度非常快。我们考虑了物种谱系间、基因间的速率变化,它们的谱系与基因的相互作用,以及基因树分支间的速率变化。我们根据出生 - 死亡 - 杂交过程模拟物种网络,以捕捉一系列现实的物种系统发育情况。对于所测试的所有三种方法,我们发现当物种谱系间存在速率变化时,网状结构的错误发现(I型错误率)显著增加。D3检验最为敏感,I型错误率约为80%,这表明D3对偏离分子钟的情况似乎比对网状结构的存在更为敏感。对于所有三种检验,检测杂交事件的能力随着杂交事件数量的增加而下降,这表明如果多个杂交事件发生在一小部分分类单元内,它们可能会相互掩盖。我们的研究强调了在使用基于位点的汇总统计量时考虑速率变化的必要性,并指出了不需要对谱系间或基因间进化速率进行假设的方法的优势。