Department of Mathematics and Computer Science, University of the Balearic Islands, E-07122 Palma de Mallorca, Spain.
BMC Bioinformatics. 2013 Jan 16;14:3. doi: 10.1186/1471-2105-14-3.
Phylogenetic tree comparison metrics are an important tool in the study of evolution, and hence the definition of such metrics is an interesting problem in phylogenetics. In a paper in Taxon fifty years ago, Sokal and Rohlf proposed to measure quantitatively the difference between a pair of phylogenetic trees by first encoding them by means of their half-matrices of cophenetic values, and then comparing these matrices. This idea has been used several times since then to define dissimilarity measures between phylogenetic trees but, to our knowledge, no proper metric on weighted phylogenetic trees with nested taxa based on this idea has been formally defined and studied yet. Actually, the cophenetic values of pairs of different taxa alone are not enough to single out phylogenetic trees with weighted arcs or nested taxa.
For every (rooted) phylogenetic tree T, let its cophenetic vectorφ(T) consist of all pairs of cophenetic values between pairs of taxa in T and all depths of taxa in T. It turns out that these cophenetic vectors single out weighted phylogenetic trees with nested taxa. We then define a family of cophenetic metrics dφ,p by comparing these cophenetic vectors by means of Lp norms, and we study, either analytically or numerically, some of their basic properties: neighbors, diameter, distribution, and their rank correlation with each other and with other metrics.
The cophenetic metrics can be safely used on weighted phylogenetic trees with nested taxa and no restriction on degrees, and they can be computed in O(n2) time, where n stands for the number of taxa. The metrics dφ,1 and dφ,2 have positive skewed distributions, and they show a low rank correlation with the Robinson-Foulds metric and the nodal metrics, and a very high correlation with each other and with the splitted nodal metrics. The diameter of dφ,p, for p⩾1 , is in O(n(p+2)/p), and thus for low p they are more discriminative, having a wider range of values.
系统发育树比较度量是进化研究中的重要工具,因此此类度量的定义是系统发育学中的一个有趣问题。五十年前,Sokal 和 Rohlf 在《Taxon》一文中提出,通过使用它们的协方差值半矩阵对一对系统发育树进行编码,然后比较这些矩阵,从而定量测量一对系统发育树之间的差异。从那时起,这个想法已经被多次用于定义系统发育树之间的不相似性度量,但据我们所知,基于这个想法,尚未正式定义和研究过加权具有嵌套分类单元的系统发育树的适当度量。实际上,仅不同分类单元的协方差值不足以单独挑选出具有加权弧或嵌套分类单元的系统发育树。
对于每棵(有根的)系统发育树 T,让它的协方差向量φ(T) 由 T 中分类单元对之间的所有对协方差值和 T 中分类单元的所有深度组成。事实证明,这些协方差向量可以挑选出具有嵌套分类单元的加权系统发育树。然后,我们通过 Lp 范数比较这些协方差向量来定义一个协方差度量族 dφ,p,并分析或数值地研究它们的一些基本性质:邻居、直径、分布以及它们彼此之间以及与其他度量之间的秩相关性。
协方差度量可以安全地用于具有嵌套分类单元的加权系统发育树,并且不受度的限制,并且可以在 O(n2) 时间内计算,其中 n 表示分类单元的数量。度量 dφ,1 和 dφ,2 具有正偏态分布,它们与 Robinson-Foulds 度量和节点度量的秩相关性较低,与分裂节点度量的相关性很高,与彼此的相关性也很高。对于 p ⩾ 1,dφ,p 的直径为 O(n(p+2)/p),因此对于低 p,它们的区分度更高,具有更宽的取值范围。