Department of Statistics, The Ohio State University, Columbus, OH, USA.
Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH, USA.
Syst Biol. 2021 Jan 1;70(1):33-48. doi: 10.1093/sysbio/syaa039.
Numerous methods for inferring species-level phylogenies under the coalescent model have been proposed within the last 20 years, and debates continue about the relative strengths and weaknesses of these methods. One desirable property of a phylogenetic estimator is that of statistical consistency, which means intuitively that as more data are collected, the probability that the estimated tree has the same topology as the true tree goes to 1. To date, consistency results for species tree inference under the multispecies coalescent (MSC) have been derived only for summary statistics methods, such as ASTRAL and MP-EST. These methods have been found to be consistent given true gene trees but may be inconsistent when gene trees are estimated from data for loci of finite length. Here, we consider the question of statistical consistency for four taxa for SVDQuartets for general data types, as well as for the maximum likelihood (ML) method in the case in which the data are a collection of sites generated under the MSC model such that the sites are conditionally independent given the species tree (we call these data coalescent independent sites [CIS] data). We show that SVDQuartets is statistically consistent for all data types (i.e., for both CIS data and for multilocus data), and we derive its rate of convergence. We additionally show that ML is consistent for CIS data under the JC69 model and discuss why a proof for the more general multilocus case is difficult. Finally, we compare the performance of ML and SDVQuartets using simulation for both data types. [Consistency; gene tree; maximum likelihood; multilocus data; hylogenetic inference; species tree; SVDQuartets.].
在过去的 20 年中,已经提出了许多在合并模型下推断种系发生关系的方法,并且关于这些方法的相对优势和劣势的争论仍在继续。一个系统发育估计器的理想性质是统计一致性,这意味着直观地说,随着更多数据的收集,估计树与真实树具有相同拓扑结构的概率趋近于 1。迄今为止,仅对汇总统计方法(例如 ASTRAL 和 MP-EST)推导了多物种合并(MSC)下物种树推断的一致性结果。这些方法在存在真实基因树的情况下是一致的,但在从具有有限长度的基因树的基因树估计数据时可能不一致。在这里,我们考虑了 SVDQuartets 在一般数据类型下对于四个分类群的统计一致性问题,以及在数据是在 MSC 模型下生成的位点集合的情况下最大似然(ML)方法的问题,即这些数据是条件独立于物种树的(我们称这些数据为合并独立位点[CIS]数据)。我们表明,SVDQuartets 对于所有数据类型都是统计一致的(即,对于 CIS 数据和多基因数据),并且我们推导出了它的收敛速度。我们还表明,在 JC69 模型下,ML 对于 CIS 数据是一致的,并讨论了为什么对于更一般的多基因情况很难证明。最后,我们使用模拟对这两种数据类型比较了 ML 和 SVDQuartets 的性能。[一致性;基因树;最大似然;多基因数据;系统发育推断;物种树;SVDQuartets。]