基于基因组数据的混合检测常用方法的性能比较。

Comparative Performance of Popular Methods for Hybrid Detection using Genomic Data.

机构信息

Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, USA.

Department of Statistics, The Ohio State University, Columbus, OH 43210, USA.

出版信息

Syst Biol. 2021 Aug 11;70(5):891-907. doi: 10.1093/sysbio/syaa092.

DOI:10.1093/sysbio/syaa092

PMID:33404632

Abstract

Interspecific hybridization is an important evolutionary phenomenon that generates genetic variability in a population and fosters species diversity in nature. The availability of large genome scale data sets has revolutionized hybridization studies to shift from the observation of the presence or absence of hybrids to the investigation of the genomic constitution of hybrids and their genome-specific evolutionary dynamics. Although a handful of methods have been proposed in an attempt to identify hybrids, accurate detection of hybridization from genomic data remains a challenging task. In addition to methods that infer phylogenetic networks or that utilize pairwise divergence, site pattern frequency based and population genetic clustering approaches are popularly used in practice, though the performance of these methods under different hybridization scenarios has not been extensively examined. Here, we use simulated data to comparatively evaluate the performance of four tools that are commonly used to infer hybridization events: the site pattern frequency based methods HyDe and the $D$-statistic (i.e., the ABBA-BABA test) and the population clustering approaches structure and ADMIXTURE. We consider single hybridization scenarios that vary in the time of hybridization and the amount of incomplete lineage sorting (ILS) for different proportions of parental contributions ($\gamma$); introgressive hybridization; multiple hybridization scenarios; and a mixture of ancestral and recent hybridization scenarios. We focus on the statistical power to detect hybridization and the false discovery rate (FDR) for comparisons of the $D$-statistic and HyDe, and the accuracy of the estimates of $\gamma$ as measured by the mean squared error for HyDe, structure, and ADMIXTURE. Both HyDe and the $D$-statistic are powerful for detecting hybridization in all scenarios except those with high ILS, although the $D$-statistic often has an unacceptably high FDR. The estimates of $\gamma$ in HyDe are impressively robust and accurate whereas structure and ADMIXTURE sometimes fail to identify hybrids, particularly when the proportional parental contributions are asymmetric (i.e., when $\gamma$ is close to 0). Moreover, the posterior distribution estimated using structure exhibits multimodality in many scenarios, making interpretation difficult. Our results provide guidance in selecting appropriate methods for identifying hybrid populations from genomic data. [ABBA-BABA test; ADMIXTURE; hybridization; HyDe; introgression; Patterson's $D$-statistic; Structure.].

摘要

种间杂交是一种重要的进化现象，它在种群中产生遗传变异性，并促进自然界中物种的多样性。大规模基因组数据集的出现彻底改变了杂交研究，从观察杂种的存在与否转变为研究杂种的基因组组成及其基因组特异性进化动态。尽管已经提出了少数几种方法来试图识别杂种，但从基因组数据中准确检测杂交仍然是一项具有挑战性的任务。除了推断系统发育网络或利用成对分歧的方法外，基于位点模式频率和群体遗传聚类的方法在实践中也得到了广泛应用，尽管这些方法在不同的杂交场景下的性能尚未得到广泛检验。在这里，我们使用模拟数据比较评估了四种常用于推断杂交事件的工具的性能：基于位点模式频率的方法 HyDe 和$D$-统计量（即 ABBA-BABA 检验）以及群体聚类方法 structure 和 ADMIXTURE。我们考虑了不同亲本贡献比例（$\gamma$）下杂交时间和不完全谱系分选（ILS）程度不同的单一杂交场景、渐渗杂交、多种杂交场景以及祖先和近期杂交场景的混合。我们关注检测杂交的统计能力和错误发现率（FDR），用于比较$D$-统计量和 HyDe，以及 HyDe、structure 和 ADMIXTURE 对$\gamma$估计的均方误差的准确性。HyDe 和$D$-统计量在除 ILS 较高的情况之外的所有场景中都非常有效地检测杂交，尽管$D$-统计量通常具有不可接受的高 FDR。HyDe 中的$\gamma$估计值非常稳健和准确，而 structure 和 ADMIXTURE 有时无法识别杂种，尤其是当亲本贡献比例不对称时（即，当$\gamma$接近 0 时）。此外，结构中使用的后验分布在许多场景中表现出多模态，使得解释变得困难。我们的结果为从基因组数据中识别杂种群体选择合适的方法提供了指导。[ABBA-BABA 检验；ADMIXTURE；杂交；HyDe；渐渗；Patterson 的$D$-统计量；结构。]