Kalaghatgi Prabhav, Pfeifer Nico, Lengauer Thomas
Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany.
Mol Biol Evol. 2016 Oct;33(10):2720-34. doi: 10.1093/molbev/msw123. Epub 2016 Jul 19.
The widely used model for evolutionary relationships is a bifurcating tree with all taxa/observations placed at the leaves. This is not appropriate if the taxa have been densely sampled across evolutionary time and may be in a direct ancestral relationship, or if there is not enough information to fully resolve all the branching points in the evolutionary tree. In this article, we present a fast distance-based agglomeration method called family-joining (FJ) for constructing so-called generally labeled trees in which taxa may be placed at internal vertices and the tree may contain polytomies. FJ constructs such trees on the basis of pairwise distances and a distance threshold. We tested three methods for threshold selection, FJ-AIC, FJ-BIC, and FJ-CV, which minimize Akaike information criterion, Bayesian information criterion, and cross-validation error, respectively. When compared with related methods on simulated data, FJ-BIC was among the best at reconstructing the correct tree across a wide range of simulation scenarios. FJ-BIC was applied to HIV sequences sampled from individuals involved in a known transmission chain. The FJ-BIC tree was found to be compatible with almost all transmission events. On average, internal branches in the FJ-BIC tree have higher bootstrap support than branches in the leaf-labeled bifurcating tree constructed using RAxML. 36% and 25% of the internal branches in the FJ-BIC tree and RAxML tree, respectively, have bootstrap support greater than 70%. To the best of our knowledge the method presented here is the first attempt at modeling evolutionary relationships using generally labeled trees.
广泛使用的进化关系模型是一种二叉树,所有分类单元/观测值都位于叶子节点处。如果分类单元在进化时间上被密集采样且可能存在直接的祖先关系,或者如果没有足够的信息来完全解析进化树中的所有分支点,那么这种模型就不合适。在本文中,我们提出了一种基于距离的快速凝聚方法,称为家族合并(FJ),用于构建所谓的一般标记树,其中分类单元可以位于内部节点,并且树可能包含多歧分支。FJ基于成对距离和距离阈值构建这样的树。我们测试了三种阈值选择方法,即FJ-AIC、FJ-BIC和FJ-CV,它们分别最小化赤池信息准则、贝叶斯信息准则和交叉验证误差。与模拟数据上的相关方法相比,FJ-BIC在广泛的模拟场景中重建正确树方面表现出色。FJ-BIC被应用于从已知传播链中的个体采样的HIV序列。发现FJ-BIC树与几乎所有传播事件都兼容。平均而言,FJ-BIC树中的内部分支比使用RAxML构建的叶子标记二叉树中的分支具有更高的自展支持率。FJ-BIC树和RAxML树中分别有36%和25%的内部分支自展支持率大于70%。据我们所知,本文提出的方法是首次尝试使用一般标记树对进化关系进行建模。