Ben-Dor A, Chor B, Graur D, Ophir R, Pelleg D
Department of Computer Science, Technion, Haifa, Israel.
J Comput Biol. 1998 Fall;5(3):377-90. doi: 10.1089/cmb.1998.5.377.
In this work we present two new approaches for constructing phylogenetic trees. The input is a list of weighted quartets over n taxa. Each quartet is a subtree on four taxa, and its weight represents a confidence level for the specific topology. The goal is to construct a binary tree with n leaves such that the total weight of the satisfied quartets is maximized (an NP hard problem). The first approach we present is based on geometric ideas. Using semidefinite programming, we embed the n points on the n-dimensional unit sphere, while maximizing an objective function. This function depends on Euclidean distances between the four points and reflects the quartet topology. Given the embedding, we construct a binary tree by performing geometric clustering. This process is similar to the traditional neighbor joining, with the difference that the update phase retains geometric meaning: When two neighbors are joined together, their common ancestor is taken to be the center of mass of the original points. The geometric algorithm runs in poly(n) time, but there are no guarantees on the quality of its output. In contrast, our second algorithm is based on dynamic programming, and it is guaranteed to find the optimal tree (with respect to the given quartets). Its running time is a modest exponential, so it can be implemented for modest values of n. We have implemented both algorithms and ran them on real data for n = 15 taxa (14 mammalian orders and an outgroup taxon). The two resulting trees improve previously published trees and seem to be of biological relevance. On this dataset, the geometric algorithm produced a tree whose score is 98.2% of the optimal value on this input set (72.1% vs. 73.4%). This gives rise to the hope that the geometric approach will prove viable even for larger cases where the exponential, dynamic programming approach is no longer feasible.
在这项工作中,我们提出了两种构建系统发育树的新方法。输入是n个分类单元上的加权四重奏列表。每个四重奏是四个分类单元上的子树,其权重代表特定拓扑结构的置信水平。目标是构建一棵有n个叶子的二叉树,使得满足的四重奏的总权重最大化(这是一个NP难问题)。我们提出的第一种方法基于几何思想。使用半定规划,我们将n个点嵌入到n维单位球面上,同时最大化一个目标函数。这个函数取决于四个点之间的欧几里得距离,并反映四重奏拓扑结构。给定嵌入后,我们通过执行几何聚类来构建一棵二叉树。这个过程类似于传统的邻接法,不同之处在于更新阶段保留几何意义:当两个邻居连接在一起时,它们的共同祖先被视为原始点的质心。几何算法的运行时间为多项式时间(poly(n)),但其输出质量没有保证。相比之下,我们的第二种算法基于动态规划,并且保证能找到最优树(相对于给定的四重奏)。它的运行时间是适度指数级的,所以可以针对适度的n值进行实现。我们已经实现了这两种算法,并在n = 15个分类单元(14个哺乳动物目和一个外群分类单元)的真实数据上运行了它们。得到的两棵树改进了之前发表的树,并且似乎具有生物学相关性。在这个数据集上,几何算法生成的一棵树的得分是该输入集上最优值的98.2%(72.1%对73.4%)。这让人希望几何方法即使在指数级的动态规划方法不再可行的更大情况下也能证明是可行的。