基于四重体的物种树推断汇总方法的大样本渐近行为。

The large-sample asymptotic behaviour of quartet-based summary methods for species tree inference.

机构信息

School of Mathematics and Statistics / Melbourne Integrative Genomics, The University of Melbourne, Melbourne, 3010, VIC, Australia.

Institut des Sciences de l'Evolution, Université Montpellier, CNRS, EPHE, IRD, Montpellier, 34095, France.

出版信息

J Math Biol. 2022 Aug 17;85(3):22. doi: 10.1007/s00285-022-01786-4.

DOI:10.1007/s00285-022-01786-4

PMID:35976512

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9385842/

Abstract

methods seek to infer a species tree from a set of gene trees. A desirable property of such methods is that of statistical consistency; that is, the probability of inferring the wrong species tree (the error probability) tends to 0 as the number of input gene trees becomes large. A popular paradigm is to infer a species tree that agrees with the maximum number of quartets from the input set of gene trees; this has been proved to be statistically consistent under several models of gene evolution. In this paper, we study the asymptotic behaviour of the error probability of such methods in this limit, and show that it decays exponentially. For a 4-taxon species tree, we derive a closed form for the asymptotic behaviour in terms of the probability that the gene evolution process produces the correct topology. We also derive bounds for the sample complexity (the number of gene trees required to infer the true species tree with a given probability), which outperform existing bounds. We then extend our results to bounds for the asymptotic behaviour of the error probability for any species tree, and compare these to the true error probability for some model species trees using simulations.

摘要

方法试图从一组基因树推断出物种树。这些方法的一个理想属性是统计一致性；也就是说，随着输入基因树数量的增加，推断出错误物种树的概率（错误概率）趋于 0。一种流行的范例是推断出与输入基因树集中最大数量的四分体一致的物种树；在几种基因进化模型下，这已被证明具有统计一致性。在本文中，我们研究了在这种极限下此类方法的错误概率的渐近行为，并表明它呈指数衰减。对于 4 分类群物种树，我们根据基因演化过程产生正确拓扑的概率，推导出了渐近行为的封闭形式。我们还推导出了样本复杂度（为了以给定概率推断出真实物种树所需的基因树数量）的界，这些界优于现有界。然后，我们将我们的结果扩展到任何物种树的错误概率的渐近行为的界，并使用模拟将这些界与一些模型物种树的真实错误概率进行比较。