Susko Edward
Dalhousie University.
Stat Appl Genet Mol Biol. 2011;10:Article 10. doi: 10.2202/1544-6115.1626.
Simulation studies have been the main way in which properties of maximum likelihood estimation of evolutionary trees from aligned sequence data have been studied. Because trees are unusual parameters and because fitting is computationally intensive, such studies have a heavy computational cost. We develop an asymptotic framework that can be used to obtain probabilities of correct topological reconstruction and study other properties of likelihood methods when a single split is poorly resolved. Simulations suggest that while approximations to log likelihood differences are better for less well-resolved topologies, approximations to probabilities of correct reconstruction are generally good. We used the approximations to investigate biases in estimation and found that maximum likelihood estimation has a long-branch-repels bias. This differs from the long-branch-attracts bias often reported in the literature because it is a different form of bias. For maximum likelihood estimation, usually long-branch-attracts bias results arise in the presence of model misspecification and are a form of statistical inconsistency where the estimated tree converges upon an incorrect tree with long edges together. Here, by bias we mean a tendency to favour a particular topology when data are generated from a four-taxon star tree. While we find a tendency to favour the tree with long branches apart, with more extreme long edges, a strong small sequence-length long-branch-attracts bias overwhelms the long-branch-repels bias. The long-branch-repels bias generalizes to five and six taxa in the sense that subtrees containing taxa that are all distant from the poorly resolved split repel each other.
模拟研究一直是研究从比对序列数据中进行进化树最大似然估计性质的主要方式。由于树是特殊的参数,且拟合计算量很大,此类研究的计算成本很高。我们开发了一个渐近框架,可用于在单个分裂解析度较差时获得正确拓扑重建的概率,并研究似然方法的其他性质。模拟表明,虽然对数似然差异的近似值对于解析度较低的拓扑更好,但正确重建概率的近似值通常也不错。我们使用这些近似值来研究估计中的偏差,发现最大似然估计存在长枝排斥偏差。这与文献中经常报道的长枝吸引偏差不同,因为它是一种不同形式的偏差。对于最大似然估计,通常长枝吸引偏差结果出现在模型错误设定的情况下,并且是一种统计不一致的形式,其中估计的树收敛于具有长边缘的不正确树。这里,偏差是指当数据从四分类单元星型树生成时倾向于支持特定拓扑的趋势。虽然我们发现倾向于支持长枝分开的树,长边缘更极端,但强烈的小序列长度长枝吸引偏差压倒了长枝排斥偏差。长枝排斥偏差在某种意义上推广到了五分类单元和六分类单元,即包含与解析度较差的分裂都相距较远的分类单元的子树相互排斥。