School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, USA.
Department of Mathematics and IDSS, Massachusetts Institute of Technology, Cambridge, USA.
J Math Biol. 2022 Apr 8;84(5):36. doi: 10.1007/s00285-022-01731-5.
Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.
物种树估计面临许多重大障碍。其中最主要的是,描述每个基因祖先谱系的树——基因树——通常与物种树不同。多物种合并通常用于模拟这种基因树分歧,至少当它被认为是由不完全谱系分选引起的,这是一种群体遗传效应。该领域的另一个重大挑战是,与每个基因相关的分子序列通常提供关于基因树本身的有限信息。虽然单一位点替换的序列进化建模研究得很好,但实际上很少有具有理论保证的物种树重建方法解决这个后一个问题。相反,一个标准的但不满意的假设是,在将基因树输入所谓的总结方法之前,基因树被完美重建。因此,在开发严格考虑基因树估计误差的推断方法学方面,或者首先完全避免基因树估计方面,还有很多工作要做。在以前的工作中,在用于准确重建所需的基因座数量 m 与基因座序列长度 k 之间导出了一个数据要求权衡。结果表明,要重建一个长度为 f 的内部分支,需要 m 的数量级为 [公式:见文本]。以前的结果是在突变率以及种群大小在物种系统发育上都是恒定的这一限制假设下获得的。在这里,我们在超越这一假设的情况下进一步推广了这一结果。我们的主要贡献是在多物种合并下对分子钟情况的一种新颖简化,我们称之为随机 Farris 变换。作为推论,我们还获得了一个独立的新可识别性结果:对于任何具有 [公式:见文本] 个物种的物种树,即使在没有分子钟的情况下,也可以从其无根加权基因树的分布中识别出物种树的根拓扑。